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PREFACE 



On May 27-31, 1985, a series of symposia was held at The University 
of Western Ontario, London, Canada, to celebrate the 70th birthday of Pro- 
fessor V. M. Joshi. These symposia were chosen to reflect Professor Joshi’s 
research interests as well as areas of expertise in statistical science among 
faculty in the Departments of Statistical and Actuarial Sciences, Economics, 
Epidemiology and Biostatistics, and Philosophy. 

From these symposia, the six volumes which comprise the “Joshi 
Festschrift” have arisen. The 117 articles in this work reflect the broad 
interests and high quality of research of those who attended our conference. 
We would like to thank all of the contributors for their superb cooperation 
in helping us to complete this project. 

Our deepest gratitude must go to the three people who have spent so 
much of their time in the past year typing these volumes: Jackie Bell, Lise 
Constant, and Sandy Tamowski. This work has been printed from “camera 
ready” copy produced by our Vax 785 computer and QMS Lasergraphix 
printers, using the text processing software TEX. At the initiation of this 
project, we were neophytes in the use of this system. Thank you, Jackie, Lise, 
and Sandy, for having the persistence and dedication needed to complete this 
undertaking. 

We would also like to thank Maria Hlawka-Lavdas, our systems analyst, 
for her aid in the layout design of the papers and for resolving the many 
difficult technical problems which were encountered. Nancy Nuzum and Elly 
Pakalnis have also provided much needed aid in the conference arrangements 
and in handling the correspondence for the Festschrift. 

Professor Robert Butts, the Managing Editor of The University of West- 
ern Ontario Series in Philosophy of Science has provided us with his advice 
and encouragement. We are confident that the high calibre of the papers in 
these volumes justifies his faith in our project. 

In a Festschrift of this size, a large number of referees were needed. 
Rather than trying to list all of the individuals involved, we will simply say 
“thank you” to the many people who undertook this very necessary task for 
us. Your contributions are greatly appreciated. 

Financial support for the symposia and Festschrift was provided by The 
University of Western Ontario Foundation, Inc., The University of Western 
Ontario and its Faculties of Arts, Science, and Social Science, The UWO 
Statistical Laboratory, and a conference grant from the Natural Sciences 
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and Engineering Research Council of Canada. Their support is gratefully 
acknowledged. 

Finally, we would like to thank Professor Joshi for allowing us to hold the 
conference and produce this Festschrift in his honor. Professor Joshi is a very 
modest man who has never sought the limelight. However, his substantial 
contributions to statistics merit notice (see Volume I for a bibliography of 
his papers and a very spiffy photo). We hope he will accept this as a tribute 
to a man of the highest integrity. 




INTRODUCTION TO VOLUME III 



As recently as two decades ago, the methodology of time series con- 
sisted largely of spectral analysis and standard regression. These methods 
have continued to be important, but many new models, together with their 
concomitant analyses, have since emerged. This has resulted both in a large 
and continuing expansion of the time series “community”, and in a bur- 
geoning of the scope of application; a major use of time series methods is 
the analysis of economic data. The twenty-five articles which comprise this 
volume and which discuss both old and new models are organized so that 
papers on time series appear at the beginning and are followed by those on 
econometric models. 

The large expansion of the collection of available models has introduced 
a new and vexing problem for time series and econometric modellers, namely, 
that of model selection. Although this problem has been approached from a 
number of directions, much attention has been given recently to the study 
of criteria, such as the AIC, which can be used to compare models of widely 
differing character. The first four papers in this volume, by Akaike, Duong, 
Hannan and Terasvirta, deal with this problem of model selection. 

The level of complexity of a model useful for describing a random mech- 
anism is related to the extent of the generated data set: a small amount of 
data demands the property of simplicity in a model, whereas a large amount 
permits consideration of more complexity. How the statistician is to make 
decisions regarding the appropriate level of complexity is the subject of the 
paper by Hannan; the model selection procedure he discusses for linear sys- 
tems involves the use of criteria such as the AIC. Akaike also considers the 
problem of model selection, particularly as it applies to the selection of mod- 
els which demonstrate practical usefulness as opposed to similarity to the 
mechanism which generated the data. Duong discusses the application of 
subset selection procedures to model selection criteria with the aim of ob- 
taining a small subset of models containing the true model with specified 
high probability; this brings to model selection the notion of the confidence 
interval. Ter^virta discusses smoothing restrictions in regression and pro- 
vides generalizations of model selection criteria for use in choosing smoothing 
parameters. 

Related to the concept of model selection is that of model adequacy, 
different aspects of which are discussed by Abraham and Kedem. Kedem 
discusses a goodness-of-fit test based on a small number of higher order 

xvii 
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crossings, whereas Abraham considers the problem of detection and rejection 
of outliers. 

Inferential problems are considered in another set of papers. Hui and 
Li discuss the use of both shrinkage and empirical Bayes estimators for pa- 
rameters of moving average models of multi-item inventory systems. Kheoh 
and McLeod establish the low efficiency of certain consistent estimators in 
ARMA model estimation. Sutradhar, MacNeill and Sahrmann discuss the 
concept of time-series valued experimental designs, and propose tests that 
fit into the standard ANOVA paradigm. Space-time models are formulated 
and discussed by Aroian. 

Many statistical agencies, including Statistics Canada, use concurrent 
seasonal adjustment procedures to reduce the size of revisions as new data 
are accumulated. By applying various measures of filter revision to the 
frequency response function of seasonal adjustment filters, Dagum explores 
the problem of how often the concurrent seasonal adjustment filter of X-11- 
ARIMA should be revised. 

Fourier methods form the focus of another set of papers. Stoffer and 
Panchalingam survey Walsh- Fourier spectral methods and provide sequency 
domain analyses of binary data. Jensen and Mansinha discuss the prop- 
erties of self-similar fractal stochastic processes and explore their use in 
modelling flicker noise processes that arise in geophysics. Feuer verger dis- 
cusses properties of the empirical characteristic function and applies them 
to nonparametric testing for independence in multivariate data. 

The second half of this volume deals with topics of importance to econo- 
metrics. Granger, in a time series paper, points out that many macro- 
economic variables have a typical spectral shape, a shape consistent with 
the property that one difference will produce a stationary series. This pa- 
per considers a number of other models which generate series having typical 
spectral shape. 

Significant impetus for development of certain areas of statistical infer- 
ence comes from econometrics. This is illustrated by the papers of Phillips; 
Singh, Ullah and Carter; Vinod; and Zinde- Walsh and Ullah. Although this 
work has particular relevance for econometrics, its usefulness extends into 
surrounding areas of statistics. 

Phillips has introduced fractional matrix calculus as a new tool for the 
study of multivariate distributions useful in econometrics. The methodol- 
ogy provided in this paper unifies the theories for finite sample and asymp- 
totic distributions. Robustness of tests in regression models is the subject 
of the paper by Zinde- Walsh and Ullah. They consider problems of nu- 
merical and inferential robustness and provide bounds on critical values of 
various statistics that guarantee robustness of test conclusions. Distribu- 
tional assumptions in econometrics often may be questionable. It is possible 
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to avoid problems caused by such misspecification by using nonparametric 
techniques. Singh, Ullah and Carter present nonparametric estimation pro- 
cedures for multivariate densities and apply them to several econometrics 
problems, Vinod, as well as presenting a comprehensive review of available 
alternatives for constructing confidence intervals for ridge regression para- 
maters, also discusses a new alternative. 

Models of particular interest to econometrics are central to the papers 
of Dufour, Maasoumi, Peters and Power. Rational expectation models, in- 
troduced a decade ago, are now widely studied by econometricians. For such 
models. Power discusses asymptotic properties of single equation errors-in- 
variables estimators. Inference about the vector of covariances between the 
stochastic exploratory variables and the error term of a structural equation 
is the subject discussed by Dufour. Peters investigates the finite sample mo- 
ments of ordinary least square estimators for a simple dynamic model when 
the error term is small. Maasoumi develops small sample approximations 
to the moments of the three-stage least square reduced form estimator in a 
general linear simultaneous equation model. 

A principal use of regression is forecasting, with forecast uncertainty 
generally assessed in terms of a normality assumption which may not hold 
true. To circumvent this assumption, Veall applies the bootstrap to the 
problem of estimation of the probability distribution of the forecast error. 

The use of the mean squared-error of forecast in testing for structural 
shift in parameters is investigated by Tsurumi; sampling and Bayesian dis- 
tributions for the statistic are derived. 

The contents of the twenty-five papers in this volume, encapsulated 
above, reveal something of the panoramic breadth of applications and 
methodology of time series analysis and econometrics. The importance of 
the content, and the care taken in presentation of this content, make it the 
editors pleasure to thank the authors of the articles in this volume. We 
expect all practitioners and researchers in time series and econometric mod- 
elling will find valuable the new results presented in these papers, and will 
appreciate the efforts of the authors in carrying out the always difficult task 
of reviewing and summarizing the body of knowledge in the various fields 
covered in this volume. 
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APPROXIMATION OF LINEAR SYSTEMS 
1. INTRODUCTION 

The classical paradigm of statistics assumes that data is generated by a 
stochastic process whose structure is entirely known save for a fixed number 
of parameters. Of course there are many departures from this assumption 
and consequent statistical methods are useful over a much wider range than 
the paradigm suggests. In time series analysis the paradigm is rarely rel- 
evant, which partly explains the wide use of Fourier methods, which are 
non-par ametric. The other major part of time series analysis is that based 
on autoregressive-moving average (ARM A) models. Much of the literature 
associated with these acts as if the data are actually generated by such a 
model though this attitude has been modified in some of the systems and 
control literature (Rissanen, 1983) and in the work of Akaike and Shibata 
(see Shibata, 1980). Here the point of view will, also, be taken that the 
ARM A models are fitted as approximations. A complete treatment will be 
impossible because of the space available and also the state of the develop- 
ment of the ideas. 

In order to set the scene consider a vector stationary process 

y(t) = f; K{j)e{t - j), K{0) = ^ || K{j) ||^ < oo, 

0 

E{e(5)e(i)'} == 8st^, S > 0. (1.1) 

It is assumed that y{t), e(t), have s components and that the e{t) are the 
linear innovations, i.e. e(t) = y(t) — y(^ 1 1 — 1) where y(t | i — 1) is the best 
predictor (in the least squares sense) of y{t) from y(s), s <t. Then 

0 

^ Department of Statistics, I.A.S., Australian National University, GPO Box 4, 
Canberra, ACT 2601, Australia 

1 

I. B. MacNeill and G. J. Umphrey (eds.), Time Series and Econometric Modelling, 1-12. 
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is not only analytic for | ^ | > 1 but also det(A;) 0, | | > 1. Put 

for the infinite Hankel matrix and let Ho be composed of a minimal number, 
n, of rows of )l (all of which rows are in £ 2 ). Of course n = 00 is the standard 
case. Let jFfi be composed of the first s rows of ){ and Hq = [K^ H 2 ] where 
K consists of the first s columns of Ho- Put e[ = [e(t)',e(t + 1)', . . .] and 

I i - 1) = Ho^t-i' Since Hi, H 2 are composed of rows of ){ then 
Hi = HHoy H 2 = FHo for suitable H, F and then 

H- 1 I t) = Fx{t | t - 1) + Ke{t), y(t) = Hx{t | t - 1) -f e{t). 

If n is finite then k{z) = H{zln — F)~^K -f Ig, which is rational and, 
conversely, if k is rational then n is finite. Here we propose to study ap- 
proximations to (1.1) in general (i.e. for n = 00 ) by systems for which n, 
chosen from the data y(t), t = 1,. . .,T, is finite. The integer n is called 
the order or McMillan degree so that we are concerned with approximating 
a general system by one of finite order. We shall use rio for the true order 
when emphasis is needed. 

This point of view has consequences, of course. It makes the asymptotic 
theory more difficult since, as is almost obvious, n will increase with T. Also 
the method of maximum likelihood (ML) fails since, clearly, this method will 
choose n indefinitely large. (However the method reappears as will be seen.) 
Thus some new criterion is needed. This problem has been considered by 
Akaike (1969) and Rissanen (1983). In the case of the latter the criterion 
introduced is that of a minimal description length of the data on the basis of 
a model. Via some approximations and on the basis of an assumption that 
the e(i) are Gaussian he arrives at the criterion 

logdet Sn + d(n) logT/T, (1.2) 

where d{n) is the number of system parameters fitted, i.e., parameters other 
than those specifying S, and where Sri is Ihe ML estimate of S when the 
order is n. Akaike, on the basis of prediction theory, suggested 

log det S3„ + 2d(n)/T. (1.3) 

More generally we could take 

X(n) = log det S„ + d(n)Cr/T, Cr/T 0, Cr > 1, (1.4) 

for some prescribed sequence Ct- The conditions on Cj- are necessary for a 
reasonable procedure. Of course n is determined by minimising one of these 
formulae, subject to n < Ar, for Nt yet to be prescribed. 
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In the next section some results will be presented relating to these meth- 
ods when the models considered are autoregressions. In Section 3 general 
ARM A approximations will be discussed, albeit far from completely. In 
Section 4 a brief discussion of problems will be given. 

2. AUTOREGRESSIVE APPROXIMATION 
The model fitted is now of the form 

E - j) = ^40) - Is. (2.1) 

0 

The ^h{j) might be estimated from the Yule- Walker relations, 
h 

E -k) = Soktn, k = 0,l,...,h, 

0 

C{j) = ^f^y{t)y{t + jy, i>0; C{-j)=C{jy. 

Mean corrections would be made in practice. These Yule- Walker equations 
can be solved by a recursion on h (Whittle, 1963) that will be described 
in Section 3. They can give badly biased estimates, for even quite large T 
(Tjostheim and Paulsen, 1983). There are many modifications of them (see 
Friedlander, 1982) and we briefly discuss this, again in Section 3. Putting 
T[j) — E{y{t)y{t -h jf)'} the $/i(y) can be regarded a^ estimates of ^/i(y) 
where 



h 

Y,^h{3)V{j-k)^6ok'^K, k = 0,l,...,h-, $fc(0) = J„ 

0 

The quantity d[n) is now as is obvious. We shall have 
h 

X^^h(j)y(i-j) = Sh{t), E{eh{t)ek{ty} = S*., 

0 

where eh{t) is the error in predicting y{t) optimally from y(t — 1), . . . , y(t-/i). 
If €h{t) — then n = hSf but that is not assumed. 

Ergodicity ensures that C{j) — T{j) — > 0, a.s. (almost surely). Hence 
^h{j) ^h{j)> a.s., j = 1, . . . , n, for each fixed h. Thus this must hold also 
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when h increases sufficiently slowly. We wish to improve this weak statement. 
Such an improvement must rest on an improvement of the result concerning 
C{j) — r(y) and for this purpose additional conditions must be introduced. 
The best predictor of y[t) From y(s), s < t is E{y{i) | Tt-i} where 7* is the 
(j-algebra determined by y(s), s <t. Since e(t) = y{t) — y{t \ t — 1), 

E{e{t) I 7t-i} = E{y{t) | 7t-i} ~ ^ ~ !)• 

Thus we impose 



F;{e(t)|7t-i}=0, 

E{s{t)6{ty\r.^} = ^, 

3 = 1 ,..., 8 . ( 2 . 2 ) 

The first is reasonable for a theory of approximation of linear systems since if 
y(t 1 t — 1) and E{y{t) | Tt-i} are very different the method is inappropriate 
because the best predictor is far from the best linear predictor and a linear 
model should not be used. The later part of (2.2) is minor. However for 
some purposes the second part has to be strengthened to 

= (2.3) 

which is not so minor. For some purposes also we need the condition 

SjW I* I 1} < i=l. •• •,», (2.4) 

where by log^ x we mean logx for x > 1 and zero for x < 1. 

So far as k[z) is concerned we require 



det{fc(^)}7^0, \z\>l, II if(i) II < CX5, 

0 



when it follows that we may put 



0 



II II < 00. 

0 



(2.5) 



The following theorem then results. We use 0(*) to indicate an order 
relation that holds a.s. If the relation holds only in probability we write 
Op{-). 
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Theorem 1. If y{t) is generated by (1.1) and (2.2), (2.4) hold then, for 
Fr = o{(T/logr)i/*} 



max 

i<j<h 



oo 

- ^y(i) II = O {(logT/T)^/^} +0(1);^ II $(i) Il,a.s. 

1 



where the order relations hold a.s. and uniformly in h < Ht> If (2.3) holds 
then the second term on the right may he deleted and without (2.3) this is 
true for s = 1 provided 



lim supj 11 k{j) 11 < oo. 

j—*co 

This result is proved by Hannan and Kavalieris (1984, 1986). 

For 8=1 the result is very satisfactory, since the last condition is minor, 
and the result seems the best possible. In interpreting the result we may use 
the extension of a result due to Baxter (1967), namely, for c < oo, 

f^\\MJ)-m\\<cf;:\\m\\- ( 2 . 6 ) 

0 /i+i 

Using (2.6), under the conditions of the theorem, 

II E - E 11 = o o ( E II 11 1 - 

0 0 \h-\-l ) 

which shows how the estimated frequency response function converges to the 
true one. 

The other things that an zisymptotic theory should do is to determine 
how h, determined from (1.4), behaves and how accurate are the ih{t)y from 
(2.1), as estimates of the e{t). The latter are covered by the following. 
Theorem 2. Ify{t) is generated by (2.1) with (2.2), (2.3), (2.4), (2.5) holding 
and 

CT(loglogT)* <h< Ht = o{(T/logT)^/*}, ct T oo> 
then ^ 

= {(sA/T) 2 + Sfc - 2} {1 + 0 ( 1 )} Op(Vt) 



and if 



c(logT)* <h<Hr = o {(T/logT)!/*} 



c > 0 
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then the term Op(/i/T) may be replaced by o(h/T). 

This theorem is proved by Hannan and Kavalieris (1986). 

The evaluation shows how accurate the are. The condition h/(log 
logT)^ — > oo is not onerous since, as the next theorem shows, even for BIG 
and y{t) generated by an ARMA process (i.e. n is finite) then h increases 
as logT. The next theorem deals with h. 

Theorem 3. Under the same conditions as for Theorem 2, 

logdetSfc + hs^Cr/T = logdet + hs^Cr - 1)/T 

+ tr{ 7 -'(S^-S)}{l + Op(l)}. 

This result is essentially due to Shibata (1980) for s = 1 and e{t) Gaus- 
sian. It shows that h/h* — > 1, in probability where h* minimises hs^{CT — 
1)/T + tr{S”^(S/i - S)}. In case Cy = 2, s = 1 this is h/T -f - 1^ 
and when the process is actually ARMA then h* = logT/(— 21og/>o) where 
Po is the modulus of a zero of k{z) nearest to | ^ | = 1. This result is also 
essentially due to Shibata (1980). 

This concludes our discussion of the asymptotic theory of autoregressive 
approximation. No form of central limit theorem has been given. Such a 
theorem, say about the $/i(i), to be useful would have to be uniform in h, 
asserting, for example, that by any linear combination of the elements 
of ^h{j) - ^h{j) satisfied the central limit theorem, for the length of the 
vector of coefiicients uniformly bounded, and h bounded by some function 
of t increasing sufficiently slowly. Such a theorem can be established, but 
we shall not discuss it here. 

3. APPROXIMATION BY RATIONAL FUNCTIONS 

This section will be concerned with algorithms and how they might be 
constructed and not with theory because there is little theory available at 
the moment, except for the case where n^, the true order, is finite. (See, 
however, Taniguchi, 1980.) In order to confine the account within reasonable 
bounds let us return to )l and make some observations. Let M{n) be the 
set of all i.e. of all k{z), for which )/ is of rank n (and for which k{z) 
is analytic and has det(^) 7 ^ 0, for | 2 : | > 1). Then, as is well known, 
M{n) is an analytic manifold. An open dense set in M{n) is U[n) where 
U (n) consists of those )l for which Ho can be taken to be comprised by the 
first n rows of )( . The set of all k{z) in U{n) can be represented in ARMA 
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form as k{z = a{z) ^h[z)^ a[z) = b{z) = where, if 

n = ps + q, 0 < q < s, 

= = °]> 

A{p + 1), B{p + 1) = * , A{j) = B{j) = 0, j> p+1. 

Here all partitions are after the gth row or column and ♦ entries indicate 
freely varying submatrices. The set of all fully varying coefficients in A(y), 
B{j) will be called r. All A(y), B{j) not listed in (3.1) are freely varying. 
Since U{n) is open and dense in Af(n) it is not unreasonable to confine 
one’s attention to this set. This deserves discussion of course and will not 
be universally agreed with, but to give a fuller account would require us 
to introduce such concepts as Kronecker indices (see Kailath, 1980) as well 
as other coordinate neighbourhood in M{n) and this could not possibly be 
done within the scope of this paper. It must be remembered that here we 
are approximating to the true Hankel matrix, and are not asserting that 
no < oo. In such an approximation procedure the choice of the systems, for 
n < oo, as approximants is already arbitrary. They are chosen, partly, for 
mathematical convenience. In such a context it is not unreasonable to further 
confine ourselves. The algorithms here outlined could be used in the more 
general context where, for example, Kronecker indices are to be determined 
but we cannot discuss that though it could be a preferable procedure. 

In this context, where n zis well as t has to be determined, the computa- 
tional task becomes great. One procedure is to use the Gaussian likelihood 
and to optimise that for each n to be examined, choosing n by (1.4). The 
likelihood can be constructed using the Kalman filter apparatus for both 
the likelihood and its derivatives. Such a procedure must be iterative. An 
alternative is to choose n at each iteration, i.e. to make each iteration a step 
in a Gauss-Newton procedure that views n as a parameter to be determined 
along with r. This can reduce the calculations if the variation in n at each 
iteration can be handled by a recursion on n. Consider the approximation 
to —2T~^ (log likelihood) afforded by 

^ P ti V(w)fc*(e‘“)-^ } dw + logdet S, 

r 

/(w) = u;(«)«;(w)*, = (3.2) 

1 

Reducing the likelihood with respect to S we obtain 

logdet J A;(e*‘*’)“^/(a;)A;*(e*‘*’)“^da;| + s. 



(3.3) 
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Since k is an already available estimate of k, let 

k~^ = ic~^ 4- (r ~ f), (3.4) 

where drk~^{r — t) = Y^dik~^ (r* — tv), tv is the tth component of r and 
dik~^ indicates differentiation followed by evaluation at k. Let us use 
example, to mean that k'~^ is interpreted as a lag operator 
with z~^y{t) = y{t — 1). Then 

(ic - y{t) = t]{t) - i(t) + e(t) 

and, recalling that k = d~^b, 

i(t) - k~y{i), fj(t) = b~^y{i), |(<) = b~^i{t). (3.5) 

These may be calculated recursively using the Toeplitz assumption inherent 
in (3.2), namely y{t) = 0, t < 0. Also 

drk~^Ty{t) = b~^{ar - l)y(t) - b~^[br ~ !)«(<)• 

Thus (3.3) becomes after the use of (3.4): 

logdet tt;c(c<;)iyc(c^)*do;| + s, 

where 

T 

1 

and 

e(f) = ^(t) - i{t) + e(f) + 6“^ {o, - A,(0)} y{t) 

- 6“^ {br - Ar(0)}€(t) + 6“^ (Ar(0) - /,) {y{t) - i{t)} . 

(See (3.1) for Ar(0).) Thus we are reduced to a regression procedure. The 
calculation may be effected by a sequence of steps that we briefly describe 
before going on to detail the calculations for s = 1. 

Let us indicate iterative stages by a superscript. 

(0) Take k^^^ to be the estimate of k obtained from the autoregressive 
procedure described in Section 2. Thus 

h 

0 
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(1) Next consider n = sp and investigate p = 0, 1, 2, . . .. For this sequence 
of cases the calculation may be done by a recursion of p. The calculation 
becomes onerous as s increases since it involves a vector v[t) (see below for 
s = 1) that has 2s^ components. (The coefficients of the r* in b~^{ar — l 3 )y{t), 
b~^{br—l8)s{t). Now Ar(0) = /«, for n = ps,) This “curse of dimensionality” 
plagues all estimation once s is large. However the calculation at the first 
iteration, i.e. to determine is much simpler and does not suffer from 
these difficulties, because in , 6 = so that the recursive procedure at 
the first iteration involves only 2s components in the vector v[t). Thus one 
procedure would be to determine and examine, at later iterations, only 
a few values of p near p^^\ For details see Hannan and Kavalieris (1984). 

(2) Once p is chosen it may be sufficient to examine n — ps + 0 < g < s, 

at each iteration, only for a few values of p near p. 

(3) The value of Ct in (1.4) must be chosen. The common choices are 
Ct = 2, Ct = logT and there is evidence, that Ct = 2 has virtues (Shibata, 
1980; see however Liitkepohl, 1985). 

(4) One must not examine values of p that are too large in relation to T and 
certainly they must be o(T^/^). 

To conclude this section we give more detail for s = 1, which is an 
important case. Now g = 0 and since there is no dimensionality problem the 
full recursion may as well be done at each iteration. We recall the algorithm 
of Whittle (1963) which we use with a vector of observations that we call 
v{t)y of u components. Put 

Cv(<) = = 

e=l 

Then the algorithm computes as follows: 

FkU) = Ffc_i(i) + FH{h)h-i{h-j), 
hU) = h-iU) + h{h)FH-i{h - j), 
h{0) = Fh{0) = L, 

Fnih) = 

h{h) = 

h 

0 
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SH^{l^-Fk{h)Fh{h)]Sk-i, 

S^= [l,-Fh{h)FH(h)]sH^i, 

So = So = C„(0). 

At the autoregressive stage, (0), v{t) = y(t) and 1 / = s. At later stages u = 
2s^ and for s = 1 , in particular (t) = > where « = 0 , 1 , 2 , . . . 

indexes the iterations. For 1 = 0 then = y(t), = ^^(0, and 

so on. We now identify h with n, or equivalently with p, since n = ps and 
s = 1. Now calling Q:(i), i = 1, . . .,p, the coefficients in a{z), b{z) we 
have the recursion, using a superscript to indicate the iterative stage and a 
subscript for the recursive stage: 



1 




'4-iW 




4’^(p) 


1 




Jp-i(j). 


+ F^-i(p-jy 


jp\p). 






^P-i ^p - 1 (i) ^ - p + j) 

0 t 

(^p’O (^P-0 “ (4’^(P)>^P^(P))^P-1 (“p’(p)>^^*^(p)) ’ 

Of course is chosen to minimise 

log + 2pCt/T, p = 0,l,.... 



4‘ (p) 

Jp\p). 



The summations, 0 < t for which 17 (t), i{t) etc. are non- 

null and these ranges are determined by the rule y(t) = 0,t<0,t>T. 
When this is done we are acting consistently with the minimisation of (3.2). 
However it is known that this “Toeplitz” convention leads to biases. There 
is a large literature that tries to mitigate its effects. See Friedlander (1982) 
for references. 
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4. SOME FURTHER COMMENTS 

Except when s = 1 these procedures, or indeed any procedures for fitting 
ARM A models, do not seem to have been widely used except in cases where 
there are fairly considerable fixed constraints based on physical understand- 
ing. 

In the ad hoc fitting procedures here considered the procedure may 
constitute only a stage towards a final result, such as spectral estimation. 
Though the consideration of all rational transfer function systems has appeal 
it is also computationally costly and the cost may become considerable if the 
program is to run when the best fitting structure is near to unstable, e.g. if 
det6o(^) is near to zero at | ^ | = 1 or ao(.?), bo{z) are near to non-coprime. 
(For example, the possibility that bo[z) is zero for | z | < 1 will have to be 
taken care of in (3.5). For s = 1 this is easily done.) 

It is conceivable that consideration of something less than the full set of 
rational transfer function models will be better in some circumstances. In 
any case one is concerned with an approximation to the truth and the choice 
of the model set from which to obtain the approximation is in the hands 
of the investigator and is a reflection of his skill. The use of autoregressive 
approximation is a manifestation of these considerations. 

Much remains to be done. 
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SOME REFLECTIONS ON 
THE MODELLING OF TIME SERIES 

ABSTRACT 

To respond properly to the increasing demand for efficient data process- 
ing procedures in diverse areas of application, emphasis must be placed on 
the advancement of time series modelling. The progress of the art of time 
series modelling, or of statistical modelling in general, may be accelerated by 
explicit recognition of the fact that the subject is essentially concerned with 
the proper use of false models. In this paper the implication of this point 
of view is illustrated by examples, including Bayesian models for seasonal 
adjustment and for estimation of changing spectrum, and non-Gaussian au- 
toregressive models for robust analysis of a system with sporadic impulsive 
disturbances. 



1. INTRODUCTION 

With the progress of data acquisition techniques and computing facilities 
the demand for efficient data processing procedures is increasing rapidly. 
Proper response to this societal need is vital for the future development of 
time series analysis. 

In time series literature, the main emphasis has been placed on the devel- 
opment of exact mathematical analysis of statistical procedures under simpli- 
fying assumptions such as Gaussianity and stationarity. These assumptions 
are usually accepted as being satisfied by a time series. This common atti- 
tude on the part of time series analysts is not precisely in agreement with 
the usual understanding that a statistical model is an approximation to the 
structure of a stochastic system that generated the observed data. 

^ The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu Minato-Ku, 
Tokyo 106, Japan 
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In this paper we stress a particular point of view which considers that the 
role of a statistical model is to provide proper incentive for the generation 
of necessary algorithms for efficient data processing. This means that it 
is not necessarily the similarity of the model to the unknown generating 
mechanism of a time series but rather the practical utility of the resulting 
data processing procedure that justifies the use of a time series model. We 
will call this the instrumental point of view of statistical modelling. 

Once this point of view is accepted, statisticians can freely develop new 
time series models based on their perceptions of required data processing 
procedures. The conventional structural point of view that considers a sta- 
tistical model to be the representation of a real stocha^stic structure has 
severely limited the range of the development of time series models. 

The situation is similar to the C2ise of the development of general 
Bayesian modelling. The idea that a model should be the representation 
of a unique structure for each particular application has severely limited the 
possibility of developing practical, useful Bayesian models. From our present 
point of view a statistical model is only a representation of the framework 
through which we look at a given set of data. Thus the introduction of a 
model is justified only if it leads to the development of a useful data process- 
ing procedure. The practical utility can be judged through the accumulation 
of experiences of the application to real problems. 

In the present paper the potential of the instrumental point of view is 
demonstrated by some examples of time series modelling. The first one is 
concerned with Bayesian modelling for seasonal adjustment. This example 
clarifies the unjustified nature of the usual expectation that the selection of 
a proper model from a set of models will lead to a satisfactory result. Such 
an expectation can be justified only when the set contains models that show 
sufficiently good fit to the data. This is illustrated by the comparison of 
predictive performances and likelihoods of some models. 

In the second example an extended form of Bayesian modelling for the 
analysis of a quickly changing spectrum is discussed. In this model observed 
values are allowed to come into the definition of the “prior” distribution. 
The performance of the procedure beised on this model shows that the model 
provides a reasonable basis for the analysis of a system with quickly changing 
characteristics. This shows that even a formal extension of the Bayesian 
model to allow observed variables to enter into the “prior” distribution may 
produce useful algorithms for data analysis. 

The third example is concerned with the modelling of the autoregressive 
(AR) process to control the effect of sporadic large impulsive inputs on the 
estimation of the structure of the autoregression. After discussion of the use 
of a mixed Gaussian distribution for the residual series, the use of a Cauchy 
distribution is considered. The analysis shows that, in maximum likelihood 
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computations, the Cauchy model discounts the effect of residuals with large 
absolute values on the estimation of autoregressive coefficients, at least 2 is 
compared with the Gaussian model. A numerical example shows the bias 
and robustifying effect of the model when the true structure is Gaussian. 



2. PREDICTIVE EVALUATION OF TIME SERIES MODELS 

The sezisonal adjustment procedure BAYSEA (Akaike, 1980a) is ba.sed 
on the ordinary representation 

y(n) = T(n) + 5'(n) + /(n), 

where t/(n), T(n), and S{n) denote the original observation, trend, and sea- 
sonal component, respectively. The data distribution is defined by assuming 
the i.i.d. type Gaussian structure for the distribution of the irregular com- 
ponent /(n). The prior distribution of the trend and seasonal components is 
defined by assuming that a Gaussian distribution of these components con- 
trols the smoothness of their behavior. The posterior mode, which defines 
the final estimates of the trend and seasonal components, is obtained by the 
solution of 

minl|y-r-5||* + r|| FT + GS ||^ 

where y, T, and S denote the vectors of y(n), T(n), and 5(n) (n = 
1,2, . . ., iV), II . II denotes the Euclidean norm; and F and G denote prop- 
erly defined constant matrices. The positive parameter r is determined by 
maximizing the likelihood of the Bayesian model. 

The use of this type of constrained least squares procedure for the solu- 
tion of an ill-posed problem that defies direct application of ordinary least 
squares method has been well-known. However, the performance of the pro- 
cedure is controlled by the choice of the parameter r; the application of 
Bayesian modelling for the choice of this parameter was first discussed by 
Akaike (1980b) to illustrate the instrumental use of a Bayesian model. This 
example shows typically that a prior distribution could be considered as 
an artificial construction to produce a rea.sonable data processing procedure 
rather than the representation of the “state of the mind of the analyst”, 
which is rather difficult to specify. 

In the same case of the original BAYSEA procedure, a homogeneous 
spherical Gaussian distribution W2is assumed for the differences, of some 
order, of the trend. This model is obviously unsatisfactory for the purpose of 
prediction of the trend, as it produces only a very limited form of prediction: 
for example, fiat and linear trends for models based on first and second order 
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Figure 1. Examples of the trend prediction by the basic models of BAYSEA. 



differences, respectively. Figure 1 shows examples of trend prediction with 
corresponding bands of twice the posterior standard deviations. 

The number denoted by ABIC means minus twice the logarithm of the 
likelihood of the Bayesian model defined by the integral of the likelihood of 
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each Gaussian model with respect to the prior distribution, and represents 
the badness-of-fit of each model to the observed data. The first order dif- 
ference based model shown in Figure lA produces the narrowest band of 
posterior standard deviations but shows poor fit to data with a linear trend. 

One natural idea is to modify the prior distribution of the trend compo- 
nent by considering an incomplete differencing, i.e., by assuming a first order 
AR model for the trend after proper differencing. First order AR models for 
first and second order differences of the trend provide interpolating models 
of the three models of Figure 1. Some examples of the results obtained by 
these models are illustrated in Figure 2, with observations for the last year 
added to check for goodness-of-prediction. 





Figure 2. Examples of the trend prediction by the AR modelling. 

The result shows that, although there are slight increases in the ABIC, 
both models produce twelve months ahead predictions that are in better 
agreement with the actual trend of observations than that obtained by the 
model of Figure IB. This suggests that the lower value of the ABIC of the 
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latter is due mainly to the clearly linear trend in the preceding years and 
thus that the model may not be useful when change in the trend is possible. 



3. POSTERIOR DISTRIBUTION AS THE 
CONDITIONAL LIKELIHOOD 

The example discussed in the preceding section shows the necessity of 
care in developing a Bayesian model for forecasting. In particular, it has 
shown that the posterior distribution must be carefully checked by compar- 
ing it with those of other models representing different possibilities, with 
proper attention being paid to the variation of likelihoods within the mod- 
els. This means that there are situations where one should consider the 
posterior distribution of a parameter simply as the representation of the rel- 
ative likelihood of each possible parameter value under the assumption of 
the Bayesian model. 

In Bayesian modelling the posterior distribution satisfies the relation 

p(X,A)=p(A|X)p(X), 

where p(A, A) denotes the simultaneous probability (density) of (A, A) be- 
fore observing AT, p{A \ X) denotes the posterior probability (density) of 
A, and p{X) represents the likelihood of the Bayesian model defined by the 
integral of p(AT, A) with respect to the measure of the parameter A. The 
distribution p(A, A) provides the basic framework through which one looks 
at the data, and the present observation shows that, during the process of 
model building, one should keep in mind the relative nature of the poste- 
rior distribution. If the situation demands, one should check the necessity 
of modifying p(X, A) by considering the likelihoods of possible alternative 
models. 



4. ESTIMATION OF QUICKLY CHANGING AUTOREGRESSION 

At the 1979 meeting of the International Statistical Institute the au- 
thor presented a paper on the construction of composite time series mod- 
els (Akaike, 1979). In that paper a procedure for the estimation of a 
quickly changing spectrum was introduced which was realized by minimizing 
L = SSR H- Q where SSR and Q were defined by 

p pK M 

SSR = ^ ^ [y(t) - a(m ; p)y(» - m) - o(0 : p)]* 

p=l»=(p-l)K'+l m=l 
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and 



K 



q=cJ2 



i=l 



M 



,m=l 
P pK 



a(m : l)y(t — m) 

M 

X) = p)(y(* - »«) - y(» - m - £■)) 



E 

P=2<=(p-1)K-+1 
p p/f 

E 

p=ii=(p-i)/f+i 



Lm=l 

M 



ma(m : p)y(t - m) 



Lm=l 



+ dPKa{0 : 1)^ + ePK X^ [a(0 : p) - o(0 : p - 1)]* , 

p=2 



where a(m : p) denotes the mth coefficient of autoregression for the pth span 
of data with length K (p = 1, 2, . . . , M). 

The procedure requires the choice of positive constants a, 6, c, d, and 
e. The constants a and h control the smoothness of predicted values in time 
and frequency domains, respectively, c controls the sizes of the predicted 
values for the initial span, and d and e control the size of the initial value 
and the smoothness of the constant term of the regression, respectively. 

The structure of the procedure is quite similar to that of BAYSEA but 
the term Q that defines the prior distribution depends on the observations. 
This dependence on the observed values gives the impression that the mod- 
elling is quite un-Bayesian. 

However, a Bayesian model may be viewed simply as a specification of a 
simultaneous distribution of the observation and the parameters. Thus, if by 
a proper normalization, exp(-yL) {g > 0) defines a probability distribution 
over the set of possible observations and coefficients of autoregression, its 
integral with respect to the latter will define the likelihood of the model and 
the corresponding conditional distribution of the autoregressive coefficients 
will give the posterior distribution. 

Ba^ed on this observation an experimental computer program called 
LOCCAR (locally constant AR-model fitting) was developed by slightly 
modifying the definition of Q in the equation L = SSR-\- Q to make the 
procedure operative even for the case with K = 1 (Akaike et ai., 1985). The 
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term Q is then given by: 

H+i , M 

<3 = = p) + 53 “("* • p)yi^ + (p - 1)^ - "»)]^ 

p=l ^ m=l 

M M 

+0 53l“(® : P + 1) - o(0 : p) + 53 = P + 1) 

»=1 m=l 

- a(m ; p))y{M + (p - l)iiC - m + »)]* | 

P pK 

+ a 53 53 ^ ^f)[o(0 : p) - o(0 : p - 1) 

p=2i=pK-J 

M 

+ 53 («(»« : P) - : P - l))y(» - "»)]^ 

m=l 
P M 

+ b ma{m : p)]^ -f dPKa[0 : 1)^ 

p=l m=l 

P 

+ ePKj^[a{0:p)-a{0:p-l)]\ 

P =1 

where H denotes the integer part of (Af — 1)/K^ > ^) — for t > if, 

0, otherwise, and J denotes max(if, M). 

Figure 3 shows the EEG record of some brain waves and also the spectra 
obtained by applying LOCCAR with Af = 5 and if = 4. It can be seen that 
the procedure responds to the change of the spectrum at an early stage of 
the recording. The unspecified parameters were adjusted by maximizing the 
likelihood of the model within a finite set of alternatives. Results with some 
simulated data demonstrated satisfactory performance of the procedure. 

The distribution assumed in this model may be viewed as a represen- 
tation of a Bayesian structure with a prior distribution that adapts to past 
experience. Thus it is not free from the serious problem of the choice of the 
speed of adaptation. Numerical results show that the choice of this speed is 
realized by the maximization of the likelihood of the model, at least for the 
purpose of the analysis of a peist record. 
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Figure 3. Estimation of a changing spectrum by the locally constant AR 
modelling. 
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5. AUTOREGRESSION WITH CAUCHY DISTRIBUTED 
ERROR TERMS 

One of the most popular time series models is the Gaussian autoregres- 
sive model. This model is defined by 

M 

y{i) = o(m)y(» - m) + z(»), 

m=l 

where the error term ^(t) is assumed to be independent of the past values 
t/(i — m) and 2 :(t — m), m = 1, 2, . . . , and identically distributed as Gaussian 
with mean 0 and variance v. The popularity of this me del is, to a large 
extent, due to the simplicity of the concomitant estimation procedure, which 
is usually realized by the method of least squares. 

However, there are certain situations where the assumption of Gaussian- 
ity is not appropriate. One such example is the case where the basic process 
is disturbed by some sporadic input. To model this situation precisely it is 
necessary to specify the distributional property of the input process. Here 
we consider the use of a heavy-tailed distribution for ^(t), to compensate for 
the effect of the sporadic input. 

First we consider the use of a mixture of the original Gaussian distribu- 
tion and another Gaussian distribution with mean 0 and variance w much 
larger than v. In particular, we consider the combination C{z \ = 

rG{z I v) -h (1 — r)G{z | w) defined with a weight r (0 < r < 1), where 
G{z \ v) denotes the density at ^ of a Gaussian variable with mean 0 and 
variance v. The likelihood of the model is given by 

N 

L{a,v,w) = n['‘<^(^(0 I »') + (1 - I «')]• 

» = 1 

Expansion of the product into the sum of the products of proper combina- 
tions of the two terms inside the square brackets yields: 

L{a,v,w) = ^r‘=(l - : A:)[]][G(z(») : v) JJ G{z{j) : «;)]}, 

A;=o %ei jeJ 

where S{I, J : k) denotes the summation over the possible mutually exclusive 
sets I and J of t and j such that the number of elements within I is equal 
to k and that of J is AT ~ A:. 

This representation shows that the likelihood is composed of the 
weighted sum of the likelihoods of models, each one of which assumes a 




THE MODELLING OF TIME SERIES 



23 



particular sequence of locations of wild inputs. By considering these mod- 
els separately, one can develop detailed analyses of the nature of the input 
process. Thus there is a possibility of performing the Bayesian analysis by 
assuming a prior distribution for the distribution of wild inputs over time. 
This type of modelling ha^ been adopted for the analysis of outlying obser- 
vations by Kitagawa and Akaike (1982). 

In the above representation of the likelihood the enumeration of all the 
possible locations of wild inputs quickly causes insurmountable complexity 
of the likelihood computation when N is large. If one is interested only in 
obtaining a reasonable estimate of the autoregression, then it is desirable 
to have a structure that is insensitive to the location of wild inputs. Such 
a model can be obtained by assuming a single heavy-tailed distribution for 
the error term. 

The likelihood of a model defined by assuming the Cauchy distribution 
C(*|c) = l/{;rc[l+(^/cn} 
for the residuals is given by 



N 

I,(a,c) = JJC'(z(») :c), 
*=1 



2(‘) = y(») - X) " "»)• 

m=l 

One obtains estimates of these parameters by maximizing the likelihood 
with respect to the (vector) parameter a(*)(= (<*(!)> ^(2), •••, <^(-^))) 
c. The maximum likelihood computation can be performed by a numerical 
optimization procedure. 

To derive some insight into the nature of the estimator of the autoregres- 
sive parameter a{^) we consider the structure of a single step of the Newton- 
Raphson procedure for the maximization of L(a, c). We define NLL, twice 
the negative log likelihood, by 



N 

NLL = 2N\og{n) + 2 ^^log(z(t)^ + d) — JVlog(d), 

»=i 



where d — c^. The quantities required for the computation of the Hessian 
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and gradient are given by 

D{NLLld) = 25[1/Q(i)] - N/d 
D{NLL/a{m)) = -45[«(»)!/(» - m)/g(»)] 

DD{NLL/d, d) = -25[1/Q(i)*] + N/d^ 

DD{NLL/a{m),d) = 4S[«(i)y(t - m)/Q{%Y\ 

DD{NLL/a{m),a{l)) = 45[{y(t - m)y(i - /)/<?(t)}{l - 22(»)V<?(»)}]. 

where S denotes the summation over i = 1,2, . . TV, Q{i) = z{t)^ + d, and 
D{NLL/a{m)) and DD{NLL/a{m)ya{l)) denote partial derivatives of the 
first and second orders of NLL with respect to a(m), and a(m) and a(/), 
respectively. 

Consider the situation where d is given. Then the elements of the gra- 
dient and Hessian are given by 

g{m) = D{NLL/a{m)) = -4S[2(t)y(t - m)/<3(T)], 



and 



= DD{NLL/a{m),a{l)) 

= 45[{y(,- - m)y(.- - 0/<3(')}{(-^(«r + d)/Q(.)}]. 

From this one may note that when z{iy is very small compared with d, then 
the contribution from z{i) to g{m) is given approximately by -Az{i)y{i - 
m)/d, and it is given by -4{^r(i)y(t - m)/d}{d/z{i)^} when ^(t)^ is very 
large compared with d. 

Consider the contribution to the gradient from the observation with z{iy 
much larger than d, which controls the scale of the Cauchy distribution. The 
leist result shows that this is significantly discounted compared with 4z(i) 
y{i — m)/d which approximates the contribution from the observation with 
z(i)^ much smaller than d. 

Similarly it can be seen that the contribution from the observation with 
small z{iy to the element H{myl) is approximately equal to 
This result and the former result for the gradient, show that the contribution 
from observations with small z{%Y to the one-step correction by the Newton- 
Raphson procedure is similar to that under the assumption of the Gaussian 
distribution, while the contribution corresponding to an observation with 
large z{%Y is discounted. 

The present observation suggests that the maximum likelihood estimate 
of a(-) will be insensitive to an input ^(i) with large amplitude; i.e., the 
estimates of the autoregressive coefficients will show some kind of robustness 
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against a wild input and that they will approximate the Gaussian estimates 
when wild inputs are absent. 

The result of a numerical experiment with the present procedure is il- 
lustrated in Figure 4. The theoretical transfer function is defined by 

/ M 

Mf) = ( 1 “ X) exp(-t2jrm/) 

\ m=l 

and the figures are on the scale of 20 log | A{f) |. The estimates by the fourth 
order AR models were obtained by applying the minimum AIC procedure 
to the models up to order M = 10, fitted to the first 300 observations. The 
observations were generated from the stationary Gaussian process defined 
by the frequency response function and the innovation process z{i) which is 
Gaussian with mean 0 and variance appropriate to simulate a real record of 
an earthquake. The simulated record is shown in Figure 5, below the actual 
seismic record. 

The lower two estimates were obtained using the 340 observations with 
the additional last 40 input values z{i) generated from a zero mean Gaussian 
distribution with standard deviation five times that of the original. To see 
the effect of these abnormal inputs the coefficients of autoregression a(m) 
were replaced by (0.9) "*’a(m) to generate the observations y(i) for the span 
of I that corresponded to the span of wild inputs. 

It can be seen from the estimates that the effect of the abnormal input 
is quite significant for the Gaussian estimate, while the Cauchy estimate 
is rather insensitive. However, the Cauchy estimates show a systematic 
deviation or bias from the theoretical transfer function with lower values in 
the very low frequency range. 

The likelihood of the Gaussian model is either higher or lower than that 
of the Cauchy model for either the first 300 observations or for the 340 obser- 
vations, respectively. Taking into account the fact that the Cauchy model is 
never a real representative of the generating process of the data, this result 
demonstrates the use of the expected log likelihood as the basic criterion of 
fit of instrumental models with different distributional characteristics. The 
fact that the log likelihood can provide a reasonable criterion of fit of a false 
“model” is based on the fact that its expectation forms an “objective” cri- 
terion of fit of a model to the true distribution of the data, whatever this 
latter might be; see, for example, Akaike (1985). 

The present model has been applied to an actual earthquake record 
which is shown in Figure 5. The maximum likelihood of the Gaussian model 
is higher for the span of the data where only the effect of the stationary 
microtremor is recorded, while that of the Cauchy model is higher for the 
total span of data where the arrival of the P-wave of the earthquake is 
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THEORETICAL TRANSFER FUNCTION 
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Figure 4. Comparison of the robustness of the AR estimates based on the 
assumption of the Gaussian and Cauchy distributions of error terms. 



observed. This demonstrates the possibility of the combined use of the two 
models for the detection of abnormal inputs. 



6. CONCLUDING REMARKS 

In this paper, emphasis has been placed on the instrumental point of 
view, which considers a statistical model as an artificial structure for the 
development of useful data processing procedures. The numerical results 
demonstrate the potential of this point of view for the advancement of mod- 
elling of time series. 

The most serious problem with the application of this point of view is 
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SIMULATED RECORD 
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GAUSS AR 4 815.5 GAUSS AR 9 14D7.9 

CAUCHY AR 4 919.5 CAUCHY AR 8 1224.8 
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Figure 5. Performance of AIC in detecting the structural change of a time 
series. Real and simulated examples. 

how to evaluate the goodness of a model. The discussion of the predictive 
performance of BAYSEA models has shown that it is only after detailed 
comparison of various possible models that we can develop sufficient confi- 
dence in applying a model or procedure for the analysis of a particular set 
of data. In such a situation, as the number of possible models increases, it 
becomes impossible to reach a reasonable decision within a limited amount 
of time without a proper criterion of fit of a model. The use of a large num- 
ber of models is becoming the practice rather than the exception with the 
development of efficient computing procedures. 

The instrumental point of view introduces much freedom into Bayesian 
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modelling. The example of the LOCCAR procedure suggests the practical 
utility of a procedure which can formally be viewed as the Bayes procedure 
defined with a prior distribution that depends on data. 

The concept of the expected log likelihood, of which estimates can be 
obtained from the NLL statistic, does play a vital role in developing criteria 
such as AIC and ABIC. The use of such a criterion becomes mandatory 
when the comparison is concerned with models with different distributional 
properties, as in the case of the comparison of the Gaussian and Cauchy AR 
models. 

The importance of developing new structural models based on knowledge 
of a particular area of application should never be overlooked. Nevertheless, 
it is hoped that the discussion in this paper has demonstrated the potential 
of the instrumental point of view, when it is equipped with a criterion of fit. 
This approach opens up the possibility of developing a proper use of ‘‘false” 
models representing various ways of looking at statistical data, including 
time series. 
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MODEL SELECTION AND FORECASTING: 

A SEMI-AUTOMATIC APPROACH 

ABSTRACT 

Econometric or time-series forecasts in the telecommunications industry 
(e.g., service demand, revenue) are an important element in a number of 
decision making processes; i.e., staffing, budgeting, tariff setting and other 
marketing strategies. Since these forecasts are only one of a number of inputs 
to decision making, no optimality criterion can be defined. The absence of 
an optimality criterion and the large number of series involved makes the 
selection of models an even more difficult exercise. Usually, the selection 
process is subject to two validation procedures: first, statistical tests on 
historical data to ensure inclusion of meaningful explanatory variables and 
proper fit, and second, tests of the model’s ability to allow the evaluation 
of the impact of future (hypothetical) market conditions and/or internal or 
external (e.g.. Government) policies. 

In this paper, a two-stage ‘semi-automatic’ selection criterion, which 
produces a subset of fezisible and ‘ranked’ models using an internal validation 
procedure is suggested. The criterion used in the first stage is based on 
the performance of competing models in predicting observed data points 
(forward validation); the selected subset is then validated at the second 
stage through subjective assessments (scenario validation). 



1. INTRODUCTION 

In this paper some problems are addressed which are faced by many time 
series modellers and forecasters in selecting the ‘best’ model among a set of 
competitors using model selection criteria such as the AlC-type criteria. It is 
argued here that in most practical situations, a single model is not sufficient 
for the purpose of analysis, and hence a set containing several useful models 
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should be considered. The necessity of considering a collection of models 
arises in the C2ise of a non-stationary time series; for example, one with 
gradually changing characteristics (Akaike, 1979a). Also, results of some 
simulations (MacNeill and Duong, 1982) have shown that wrong models 
could be chosen by some of these model selection criteria when the time 
series undergoes even modest changes in its structure (parameter or order 
changes). Similar difficulties arise when one attempts to model short series, 
in which case the selection statistic may not be reliable due to small sample 
variability. An attractive solution for this problem has been discussed by 
Kitagawa and Gersch (1985a,b), who suggested using smoothness priors in 
the form of constraints on the AR model parameters. A similar approach 
has been suggested by Duong (1981). 

In the context of forecasting, the problem of choosing among several 
available forecasting models has also received much attention lately. Makri- 
dakis et al. (1982) evaluated the accuracy of 24 major forecasting methods 
on a collection of 1,001 time series. A comprehensive classification of the 
more commonly used methods and their accuracy is described by Mahmoud 
(1984). Given the current proliferation of forecasting methods, a decision- 
maker will often be presented with a set of k forecasts, or more generally, a 
set of k mathematical models which could generate forecasts. The question 
facing him is the following: should he attempt to select the ‘best’ forecast, or 
should he try to combine the k forecasts in some way? Both alternatives (se- 
lection versus synthesis) require optimality criteria, which in practice might 
be difficult to define and evaluate. 

For the two situations in hand, model selection via a minimum-selection 
criterion or choice of a forecasting method (which one may or may not be 
able to characterize by a model selection criterion), we suggest a two-stage, 
‘semi-automatic’ approach, which produces a subset of feasible and ‘ranked’ 
models. In the first stage, the performance of the competing models in pre- 
dicting observed data points is used to rank the models; a subjective selec- 
tion procedure is then used in the second stage to validate and, if necessary, 
combine the ‘good’ models. While the first stage is essentially automatic 
in nature, except for the specification of the degree of uncertainty we are 
willing to tolerate, the second stage requires subjective assessments of the 
various scenarios represented by the selected model (s). 

In Section 2, a method for ranking models based on model selection cri- 
teria is discussed. Section 3 discusses the use of forecasting performances as 
ranking measures in the more general situation when nonparametric and/or 
judgmental forecasting techniques are considered. The proposed approach 
is then applied in Section 4 to an econometric model from the telecommu- 
nications industry. 
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2. MODEL SELECTION CRITERIA 

The selection of models in regression analysis and time series analysis 
has been discussed extensively in the statistical literature during the last 
decade or so. For the case of regression analysis, measures such as Mallows’ 
Cp (see Mallows, 1973) provide a method for assessing the adequacy of a 
p-predictor model. For time series data, Akaike’s FPE (Final Prediction 
Error) W 2 is suggested and then generalized to the celebrated AIC (Akaike 
Information Criterion), defined as follows: 

AIC = —2 log (maximum likelihood) -f 2 (number of parameters). 

In a nutshell, AIC is a measure of the goodness-of-fit of the model, 
either in terms of the Kullback-Leibler information measure, or in terms of 
the expected log-likelihood ratio. Similar criteria (BIC, Schwarz’s criterion, 
Hannan and Quinn’s Parzen’s CAT, etc.) were later developed, using 
different approaches. These model selection criteria all represent a marked 
departure from the more traditional hypothesis-testing framework which has 
been found to be quite satisfactory in large sample situations, and when the 
tested models are nested. In the latter situation, it could also be shown that 
there is a connection between AIC and an jP-test (see Soderstrom, 1977). 
Although in principle, non-nested hypotheses could be entertained, several 
theoretical difficulties remain to be solved, not the least of which is the choice 
of a significance level. This choice becomes irrelevant with AIC- type model 
selection criteria. Furthermore some of these criteria can be given a Bayesian 
interpretation, with an arbitrary prior probability over the set of competing 
models, thus adding to their popularity (see Akaike, 1979b). 

Not withstanding their many merits, which include ease of computation 
and usefulness in comparing among non-nested and non-linear models, AIC- 
type model selection criteria cannot avoid the small-sample problem inherent 
in all estimation procedures based on the maximum-likelihood principle. 
More fundamentally, one might rightly question to what extent the term 
‘best’ is well defined in certain situations. A possible answer is provided 
through consideration of the concept of “subset of models” , very much like 
that of “confidence interval” for point estimation. Duong (1984) suggested 
the ranking and selection approach to this problem. The main thrust of 
this approach is the use of specific procedures for determining which models 
should be retained for further consideration, in order to have some control 
over the degree of confidence that the correct model will be included. In 
essence, the models will be ranked as usual with respect to some computed 
model selection criterion, and all models ‘close’ enough to the best Minimum- 
AIC model will be considered 2 is valid competitors and will be included 
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for further consideration in the subset of selected models. To carry out 
this procedure, one needs to specify beforehand the desired probability of 
correct selection, that is, the probability that the true model is included. For 
AR models, it could be shown that the size of the selected subset depends 
strongly on the difference between the maximum and minimum orders among 
the competing models. This point is important in many practical situations, 
for example, when costs are directly linked to model complexity. 

As already pointed out in the introduction, there could be a host of sit- 
uations where we do not have enough confidence in either the data (short 
data span, inaccuracies, etc.) or even the class of models being considered 
(e.g., it is only an approximation to some underlying law) to feel comfortable 
with one single ‘best’ model; that is, we feel that the available evidence is 
not sufficient to discriminate among the competing models. It is then neces- 
sary to weigh this evidence against our degree of confidence. Although the 
ranking and subset selection approach to be presented below is not without 
some disadvantages, especially in terms of computing the threshold value, 
we believe that it is the most natural one for this situation; the proposed 
procedure explicitly views the modelling exercise as an attempt to rank the 
competing models by how close they seem to fit the data, and hence, how 
close these are to the true model. In summary, the procedure explicitly 
contains the two main ingredients of any real decision-making situation: 

(i) a selection strategy with the aim of choosing the ‘best’ alternative, and 

(ii) a method for incorporating our subjective assessment of the data and/or 
requirements of the study through specification of the probability of 
correct selection. 

The suggested procedure is now briefly described. For more details, see 
Duong (1984). 

Assume that one has a series of length T, and consider the problem of 
selecting the “best” AR model among a set of competing models of order 
0, 1,2, . . ., jFC, using AIC as the selection criterion. The models are chosen 
as follows: 

(R) : Retain model k (autoregression of order k) in the selected subset if 
and only if 

AIC(/:) — c < min AlC(jf) (A; = 0, 1, . . . , iiC) 

where AIC(A;) = T\o%^a^[k) + 2A:, and c > 0 is a constant to be 
determined. d^(A:) is the M.L.E. of the residual variance for the A:th 
order AR model. 

The selection is said to be a correct selection (CS) if the true model of 
order ko is retained in the selected subset. Let pr[CS\R) be the probability 
of correct selection under rule R. 
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Then 

pr{CS] R) = pr {AIC(A:o) — c < ^min^ AIC(j)} 

= pr{AIC(A;o) - c < AIC(A;) (A; = 0, 1, . . . , if, A; ^ A;o)} 

= pr{(AIC(A:o) - c < AIC(A;), K > k > ko) and 
(AIC(A;o) - c < AIC(A;), 0 < A; < ko)}. 

Let A = {AIC(A;o) — c < AIC(A;), K > k > ko}. Then an argument similar 
to Shibata’s (1976) can be used to show that for c > 0 fixed as T — > oo, we 
have 

pr{AIC(A:o) — c < AIC(A;), 0 < k < ko\A} — 1. 

Hence, for T sufficiently large, 

pr{CS] R) = pr{AlC{ko) -c< AlC{k), ko < k < K}. 

Note that by replacing 2 in the definition of AIC by logT, one can get 
pr{CS;R) = 1. 

This procedure could, of course, be generalized to ARIMA models, and 
for other model selection criteria. Note that in terms of FPE, this procedure 
could be modified as follows: Retain model k in the selected subset if and 
only if 

FPE{k) < 

As the value c would be difficult to compute in most cases, more experience 
with this procedure is required. Some simulations have been carried out and 
the results discussed by Duong (1984). 

Once the subset of ‘good’ models has been formed, some refinements 
could be carried out to now rank the models in the subset on a goodness- 
of-prediction test (if forecasting is the primary goal) and to combine these 
models in some optimal way. Of course, a model could be chosen for reasons 
other than forecasting, such as spectral density estimation. The goodness- 
of-fit criterion must hence be tailored to the purpose of the analysis. We 
see this 2 is a necessary validation step, which could be carried out simultane- 
ously with the selection step through the minimization of some loss structure 
which trades off the predictive ability of the model against its goodness-of- 
fit. A simple loss function could be derived by weighting the model selection 
criterion and the predictive performance of the model: 

i(modelfc) = ;SAIC(fc) + (1 - ;S) “ yi)^> 

itJ 
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where J denotes the “training” data set used to fit the models, and yi{k) 
is the predicted value from model k. In Section 4, an example is used to 
illustrate how the “training” data could be chosen in some reasonable way. In 
general, predictions are much more model dependent than data fitting, and 
hence it is possible to choose a suitably small “training” data set to which the 
competing models are fit; this will facilitate the choice of a preferable model, 
since, hopefully, we would then have a large predictive data set to validate 
the selected models. A similar approach was taken by Hjorth (1982) using 
a forward validation procedure in connection with model selection criteria. 
Ribaric (1984) applied the same approach to the more general problem of 
parsimonious data fitting. 

More often than not, forecasts are obtained from nonpar ametric tech- 
niques or are judgmental in nature, in which cases the procedure {R) above 
must be modified. In the next section, a slightly modified procedure is given 
and illustrated. The second stage of the procedure, that is, how one goes 
about assessing the worth of the selected models and forecasts, is also dis- 
cussed. 



3. RANGE FORECAST 

In many situations, several forecasts have been made of the same event. 
They result either from the use of different forecasting techniques or from 
opinions of different individuals. As pointed out by Bates and Granger 
(1969), the forecasts which are usually discarded by the decision-maker fol- 
lowing the selection of the “best” one, may contain useful independent in- 
formation. It seems then reasonable to combine the available forecasts and 
assign the weights in such a way that the composite forecast would have a 
lower mean square error of prediction than each of its components. Bates and 
Granger (1969) and Dickinson (1973, 1975), among others, have suggested 
and studied alternative ways of computing these weights. This problem is 
very similar to that of selecting the shrinkage factor in James-Stein-type 
estimators, or in the time series context, the signal-to-noise variance ratios. 
Duong (1981) has given a detailed discussion on these more general prob- 
lems. 

The optimal weights will depend on unknown quantities (in this case, the 
forecast error variances and their correlations) , which have to be estimated 
from the observed data. Combining forecasts and, in general, estimating 
by the James-Stein procedure could be considered as methods for estimat- 
ing parameters and pooling information from different sources. Bayesian 
interpretations of these methods have been extensively discussed in the last 
decade. See Bunn (1975) for a discussion of the forecasting problem. A nat- 
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ural step to consider for certain types of problems is the use of ranking and 
subset selection procedures, especially when the classical hypothesis testing 
approach is not appropriate. Generally known as multiple decision proce- 
dures (e.g., Gibbons et al., 1977), this new research area has been developing 
rapidly, with applications in almost every field of statistics. The procedure 
(i?) described in the previous section is, to our knowledge, the first use of 
this approach in model selection problems. 

We now adapt the procedure {R) for the selection of a subset of forecasts 
from a potentially large set of competing forecasts. An example is then given 
to illustrate the procedure. 

Assume that there are k forecasting techniques in competition. Also 
assume that they are unbiased, in the sense that they do not consistently 
overestimate (or underestimate) the true value. We also impose the following 
simplifying restrictions: 

(i) There is no correlation between forecast errors. 

(ii) The number of forecast values is the same (m) for all k techniques. 

As one may suspect, these two restrictions are due to computational 
difficulties, more than for any theoretical reason. Let e^v (* = 1, 2, . . . , A;, r = 
1, 2 , . . . , m) denote the error in the rth forecast value using the ith forecasting 
technique, and let (i = 1,2,. . .,A;) denote the (unknown) error variance of 
the ith forecasting technique. Let 

(T^iy < cr^2} < * * • < <^{k} 

denote the (unknown) ordered values. For simplicity, assume that < 
^{ 2 }> following problem is well defined: 

Based on sample estimates, ~ 1>2,...,A:, identify 

the best forecasting technique, that is the one with the smallest error variance. 

The goal here is to divide the set of k forecasting techniques into two 
distinct subsets in such a way that one, the selected subset, contains the best 
technique with a high probability, at least P*, a pre-specified value, and the 
other one does not. The selection rule is simply: 

[R) For each i = 1,2,. . .,k, retain the ith forecasting technique in the 
selected subset if and only if af G I, where 

1 = [a{i},a|ij/c] 

and where 0 < c < 1 is a constant to be determined. 

The selected subset is never empty, but one usually prefers not to include too 
many competitors, subject to the condition that the probability of correct 
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selection, that is including the forecast with the smallest error variance, is 
at least P * . The values of c will depend on P* , m and k. 

By following the same technique 2 is used by Gupta and Sobel (I962a,b), 
and denoting by pr{CS ; R} the probability of correct selection under rule 
R, it can be shown that 



inf pr{C5; R} = 



minj=i,2 pX^M 

X?(m) 



> c 



where Cl is the parameter space > 0 (t = 1,2,... A;)} and where 

Xy(m) {j = 0, 1, 2, . . .,p; p = A: — 1) denote k independent-chi-square random 
variables with a common number of degrees of freedom m. Hence, the deter- 
mination of c such that pr{CS\ P} > P*, is equivalent to the determination 
of a lower percentage point of the Studentized smallest chi-square statistic 
with m degrees of freedom for all k chi-squares. Gupta and Sobel (1962b) 
have constructed tables for the largest c- values satisfying the requirement 
pr{CS;R} > P*. Note that in the case when m is an even integer, the 
result can be obtained in the form of a finite series. 



An example 

The following table (Table 1) is taken from Bates and Granger (1969) 
and gives the actual values and forecasts, using two different methods, of 
the 1948-1965 U.K. output indices for the gas, electricity and water sector, 
as published in National Income and Expenditure, 1966. It is obvious here 
that the exponential forecast procedure yields a better result than does the 
linear forecast procedure. It is certainly not clear whether the latter should 
be considered. Using the ranking approach described above, one would use 
the linear forecast if 

r r 

Considering only the latest forecasts (1958-1965), the c- values are deter- 
mined from Table 3 of Gupta and Sobel (1962b) for A; = 2, m = 8, at 
different P* values as follow: 

P* = 0.90, c = 0.3862 

P* = 0.95, c = 0.2909 

P* = 0.99, c = 0.1659 

Since X)r=i — 51.02 and ]Cr=i ^ir = 242.60, we could reach the 
following conclusions: 
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Table 1. Individual Forecasts of Output Indices for the 
Gas, Electricity and Water Sector in U.K, 



Actual Linear Exponential 

Year index forecast forecast 

(1958 100) 



1948 


58 


1949 


62 


1950 


67 


1951 


72 


1952 


74 


1953 


77 


1954 


84 


1955 


88 


1956 


92 


1957 


96 


1958 


100 


1959 


103 


1960 


110 


1961 


116 


1962 


125 


1963 


133 


1964 


137 


1965 


145 



Sum of 
squared errors 



66.0 


66.3 


71.3 


71.9 


76.5 


77.4 


79.2 


80.3 


81.9 


83.2 


89.0 


88.6 


91.6 


93.7 


96.0 


98.5 


100.2 


103.2 


104.3 


107.8 


108.1 


112.1 


112.9 


117.4 


118.0 


123.3 


124.2 


130.2 


130.9 


137.8 


137.0 


145.0 


263.2 


84.2 



(i) Since Yll-i €2^/0.3862 = 132.107, the linear forecast is excluded; 

(ii) Since e|,./0.2909 = 175.38, the linear forecast is excluded; 

(iii) Since 62^/0.1659 = 307.53, the linear forecast is included. 

Hence the linear forecast should only be considered as a valid competitor if 
one imposes a high probability (0.99) of correct selection. Dickinson (1973) 
also questioned the use of the linear forecast in this ca^e. As already men- 
tioned, the idea behind the multiple decision procedures approach is funda- 
mentally different from that for hypothesis testing or estimation. The goal 
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of the experimenter here is to rank his forecasts, assuming from the out- 
set that they are unequal. In situations where one is interested less in the 
forecasted values than in the problem of assessing the “goodness” of differ- 
ent competitive methods, which may result from differences in assumptions, 
theories, expertises, types of information used, etc., this formulation may be 
more realistic. 

Recent works by Makridakis et al. (1982) have focused on various ways 
of evaluating the accuracy of forecasting methods, including those resulting 
from combining known methods. Although accuracy is certainly an impor- 
tant feature for short-range forecasts, it is almost a futile task to try to 
improve the accuracy of long-range forecasts. By long-range forecasting, 
one means forecasting over a time horizon where “large changes in the en- 
vironment may be expected to occur” (Armstrong, 1978). This is certainly 
the case for many macro-economic forecasts, whereby economic scenarios 
are used as input to produce a set of long-range forecasts, each reflecting a 
different view about future economic situations. 

Chen and Rung (1984) used the term “range forecasting” to refer to the 
range of multiple hypothesized scenarios, as compared to the width of some 
error band (e.g., variance) for a single- valued forecast. For an econometric 
model, “range forecasting” could be carried out by either one of the two 
methods: 

(a) A range of forecasts is obtained by having different input scenarios for 
the independent variables; the model coefficient is, however, estimated 
only once from the available data. 

(b) The historical data is divided into k (possibly overlapping) periods which 
are thought to represent the k scenarios being entertained. The model 
is re-evaluated in each case to produce corresponding forecasts. 

Figure 1 depicts a typical situation with various economic scenarios, such 
as expansion, downturn, significant technological improvements, and price 
changes. 

We have extended the use of range-forecasting to include, in addition to 
econometric models, time series models (ARIMA models, non-linear models, 
non-stationary autoregressions to model trends for “Long-Memory” series, 
etc.) and smoothing techniques. Among the latter methods, the recently 
developed class of generalized Holt exponential smoothing methods with 
autoregressive-damping (AD) parameters to model linear trends could also 
be considered as an attempt to incorporate scenarios into the forecasting pro- 
cess. However, this class was intended to be part of an automatic forecasting 
system, with the AD parameter chosen on the basis of a priori information 
about the behaviour of the future trend. For example, the choice of an AD 
parameter between 0 and 1 would correspond to the trend being damped 
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Downturn 



Figure 1. Economic scenarios from historical data. 

out at rates according to the choice of parameter (Gardner and McKenzie, 
1985). Given the sensitivity of forecasts to the choice of the AD parameter, 
it is natural to consider a subset of forecasts to reflect this sensitivity. 

Posing similar challenges are forecasts of a volatile series such as Special 
Services Circuits (for example, Voice and Data services offered by a tele- 
phone company other than ordinary message telephone service, such as for- 
eign exchanges, tie lines, and off-premise extensions). A five-year forecast is 
generally required for various planning operations. Due to the large number 
of series to be forecast and to the high degree of volatility of special services 
(see Figure 2 for an example) , a time-dependent method using Kalman-fllter 
techniques was applied. However, forecasts were "smoothed” with smooth- 
ing parameter a. The computational details are given by Duong (1983). For 
various values of the parameter, which reflect subjective expectations about 
service growth, comparisons with the manual (or "current view”) forecasts 
were made on a randomly selected set of 100 series each from both Voice 
and Data services. 

It should be noted that the five-year forecasts are reviewed approxi- 
mately every year, and revisions result from the evaluation input informa- 
tion given by groups such as Marketing, Sales, and Rates. For illustrative 
purposes, we have treated the current forecasts as monthly forecasts (for the 
next 5 months). Since underprovisioning in this case has different implica- 
tions than overforecasting the service demands, the following loss measure 
was used as an aid to the choice of "good” forecasts, that is, those close to 
the current view: 



^ (KFF-CV)/5)/m 

over 'over 6 ' 

sampled years 
series 



DAVEG = 
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MONTHLY DATA (1981-1983) 

Figure 2. Special Services Forecasts. Triangles represent data points (last 
5 points are forecasts); circles are Kalman- filter estimates. 

where DAVEG is the average difference, KFF is the Kalman- filter forecast, 
and CV is the current view. Table 2 summarizes the results. With the 
interpretation of this loss measure as described above, the following general 
conclusions could be drawn: 

- the Data series are in general less volatile than the Voice series, as ex- 
pected. 

- it is quite plausible that the range a G [05 , 1] for the Data series will 
do a good job in duplicating the manual forecasts; the Voice series cer- 
tainly requires careful analysis. For example, in Figure 2 we have chosen 
the Kalman- filter parameters to reflect our assessment: in this case, an 
expected decrease. 

It should be obvious by now that we strongly advocate the use of range- 
forecasting, at least in situations where it is called for, namely when the 
forecast horizon is long enough to include potential changes which might 
have an impact on the series, and for noisy series with volatile trends. The 
“range” could be obtained by considering the most extreme forecasts from 
the selected subset. The important feature to remember here is that although 
forecasts were generated and selected in an (almost) automatic fashion, the 
final forecasts should always be subject to manual adjustments before they 
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Table 2. Smoothed Kalman- Filter Forecasts 



DAVEG 



a Data Series Voice Series 



0.0 


-2.40 


-2.40 


0.1 


-1.40 


1.60 


0.2 


-1.20 


2.46 


0.3 


-0.83 


3.30 


0.5 


-0.60 


5.00 


1 


0.44 


4.70* 



Notes: The data-files from which the samples were drawn contain approxi- 
mately 22,000 Voice series and 5,000 Data series. 

*For this case, four (4) selected series were subjectively excluded from the 
computations (forecasts differ from the most current 12-month average by 
more than 100). 



are used. Examples of causes for these adjustments in the telephone in- 
dustry include marketing strategies, sales projections, and anticipated rate 
changes. Furthermore, there might be different groups of users for the same 
forecast (s); this would result in different adjustment methods reflecting dif- 
ferent needs and interests. Figure 3 illustrates the kind of forecasting strat- 
egy we have in mind. 

In summary, we have suggested a semi-automatic approach to forecasting 
either when the series are volatile or when the forecast horizon is expected 
to include significant changes. A range of forecasts is produced through 
ranking and subset selection methods with input / output scenarios to assess 
and validate the selected forecasts. 



4. AN APPLICATION TO TELEPHONE SERVICES DEMAND 

Like many businesses and public organizations, telephone companies rely 
heavily on econometric models to provide forecasts for many policy analyses 
and corporate decisions. The models are used in a variety of applications, 
ranging from general planning activities, such as budgeting, network provi- 
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source 1 
source 2 
source k’ 



Scenario 1 



Scenario 2 



Scenario k 



Forecast 1 



AUTOMATIC 

FORECASTING 




Forecast 2 



Forecast k 



user 1 
user 2 
user k* 



Figure 3. Schema of semi-automatic forecasting system. 



sioning, and marketing strategies, to providing support for tariff revisions 
associated with rate applications. In most cases, state-of-the-art time series 
and econometric modeling techniques are used. However, there are certain 
unique features of the telecommunications environment which make these 
methods difficult to apply, such as technological advances, decisions by the 
regulatory commission, market changes to new services being offered and 
competition. All these imponderables need to be taken into account when 
constructing forecast scenarios; this is in addition to considering the usual 
economic/social factors such as strikes, recessions, and seasonality patterns. 
Each factor could have major impact on service demand and revenue. The 
construction of these scenarios is illustrated next. 

Figure 4 is a quarterly series representing network access service (on a 
log-scale) for a certain region during 1971-1984. To obtain forecasts for this 
series, an econometric model was developed using as the main explanatory 
variable the ratio of Gross Domestic Product over Population Size (15 years 
or older). Figure 5 is a graph of the logarithm of this series. Besides sea- 
sonality, two special events were also accounted for using dummy variables: 
the strike in 1979, and the economic slowdown in the first quarter of 1982. 
There are three main periods, identified by Kiss (1983) from the historical 
data, which correspond to three possible forecast scenarios: 

(i) 1971-1975: This was a period of fast growth due to technological 
improvement and high demand for telephone services. 

(ii) 1976-1980: There was a slowdown in demand which coincided with, 
and was in part caused by, worldwide economic problems and some 
political uncertainties in Canada (1976-1978); there seems to have 
been some modest gains in the 1979-1980 period, due mainly to the 
introduction of digital electronic switches (DMS). 
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(iii) 1981-1984: Except for the first quarter of 1982 where the recession 
effect was unusually severe, there are some indications of a steady 
growth in demand for that period due mainly to expansion of the 
network and addition of new services. 

It is interesting to note that the explanatory variable (Figure 5) also 
exhibits the same growth pattern as the telephone demand series, indicating 
a very strong link between general economic performance and telephone 
demand. While this link is hardly surprising, it helps develop and validate 
certain scenarios based on a modest to strong growth for this economic 
indicator (dotted lines on Figure 5); this in turn will be used to develop the 
forecast range for the telephone demand series. However, final forecasts are 
not presented here for confidentiality reasons. 

Finally, it must be emphasized that the above forecasts would almost 
certainly be revised to account for important regulatory decisions which 
were expected in 1985; these include rate increases and competition for long 
distance services. 



5. CONCLUSIONS AND EXTENSIONS 

In this paper, the problem of selecting a subset of “good” models using 
model selection criteria has been extended to include models developed for 
forecasting purposes. The concept of “range” for economic forecasting was 
also discussed and illustrated. 

As useful extensions to this approach, we are now investigating the fol- 
lowing two problems: 

(i) The use of AlC-type criteria to detect structural changes in time 
series models (e.g., changes in the order of an AR model). 

(ii) The extension of the concept of range forecasting to independent 
observations; for example, estimation of the mean under different 
distributional assumptions. 
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SMOOTHNESS IN REGRESSION: 
ASYMPTOTIC CONSIDERATIONS 

ABSTRACT 

Smoothness restrictions may be used in linear models, for instance, when 
the data set does not contain enough information to allow sufficiently accu- 
rate estimation of parameters. In this connection the problem of weighting 
the sample and non-sample information by a smoothing parameter arises. 
Suppose we want to choose the smoothing parameter in such a way that 
the mean squared prediction error (MSPE) of the model will be minimized. 
Starting from this assumption, an equation for the minimizer is derived, and 
its asymptotic solution is shown to be finite and unique. Next, it is proposed 
that common model selection criteria such as the AIC be generalized to be 
used for the estimation of the smoothing parameter. Asymptotically, several 
generalized criteria, called smoothing criteria in the paper, yield exactly the 
unique value which minimizes the MSPE. In fact, these are generalizations 
of the model selection criteria which are optimal according to the definition 
of Shibata (1981). Application of the above theoretical results to ridge re- 
gression is discussed. The ejffect of autocorrelated errors on the estimation 
of the smoothing parameter is also considered, and cases where autocorre- 
lation may have little influence on the coefficient estimates of the model are 
indicated. 



1. INTRODUCTION 



Smoothness restrictions may be used in linear models when the data set 
itself does not contain enough information to allow sufficiently accurate esti- 
mation of parameters. In that context, the problem of weighting the sample 
and non-sample information, i.e., choosing an optimal degree of smoothing. 
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arises. A similar problem appears in non-par ametric regression although 
its original motivation is usually different; smoothness restrictions are not 
necessarily a consequence of the paucity of data. Rather, they stem from 
the belief that the underlying “true” relationship between the dependent 
and the independent variable can be represented by a smooth curve; see, for 
example, Shiller (1984). The degree of smoothing is a compromise between 
an expression of this prior belief and the original variability of the data. 

A popular example of smoothness restrictions is polynomial distributed 
lag estimation. Shiller (1973) discussed the use of polynomial smoothness 
priors and a “rule of thumb” for determining the value of the smoothing 
parameter. Gersovitz and MacKinnon (1978) also applied a rule of thumb 
in their treatment of linear models whose regression coefficients are supposed 
to vary smoothly with the season. Other suggestions and more discussion on 
how to choose smoothing parameters in polynomial distributed lag models 
have been provided by Ullah and Raj (1979) and Thurman et sd. (1984); see 
also Judge et ai. (1985, Chapter 9) and Trivedi (1984). 

Golub et al, (1979; see also references therein) have addressed the prob- 
lem of obtaining the value of the smoothing parameter in non-parametric 
regression. Their method is called generalized cross validation (GCV), and 
it consists essentially of finding the global minimum of a function of the 
smoothing parameter of the model. We shall return later to the GCV method 
which was applied by Engle et ai. (1982) recently to the estimation of a 
non-parametric relationship between weather and electricity demand. Shiller 
(1984) also provided an empirical example of smoothness restrictions in non- 
parametric regression, but the value of the smoothing parameter seemed to 
have been fixed in advance. 

Ridge regression constitutes a special case of smoothness priors in linear 
models. The smoothness assumption simply means that the coefficients of 
the linear model are supposed to be “small” . A wide variety of techniques 
have been proposed for finding the smoothing parameter in ridge regres- 
sion; for discussion and examples, see, for example, Dempster et ai. (1977), 
Draper and Van Nostrand (1979) and Gibbons (1981). We shall discuss 
some of those techniques later on. Oman (1982) has studied a more general 
case where the estimates are shrunk towards a subspace rather than to zero. 
He considered the determination of the value of the smoothing parameter in 
that context. 

This paper considers a generalization of well-known model selection cri- 
teria (MSC) such as AIC or SBIC in such a way that they can be used 
for determining the value of the smoothing parameter of the model in any 
of the above cases. The idea is to generalize the number of parameters in 
the penalty function of the MSC so that the generalized number is a mono- 
tonically decreasing function of the smoothing parameter. Properties of the 
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generalized MSG which will be called smoothing criteria (SC) can be studied 
using asymptotic concepts. The asymptotic minimizer of the mean squared 
prediction error (MSPE) with respect to the smoothing parameter is unique. 
Shibata (1981) has defined an optimality concept for MSC. It turns out that 
the generalized versions of ordinary MSC optimal according to Shibata^s 
definition have the property that their asymptotic minimizers coincide with 
that of the MSPE. 

Another problem investigated in this paper is the sensitivity of the SC 
to autocorrelated errors. The conclusion is that the MSPE of the model can 
be expected to be insensitive to autocorrelation if the bias in the smoothness 
restrictions is small and/or the sample is large. In these cases, the estimated 
value of the smoothing parameter which is affected by autocorrelation may 
vary substantially with little effect on the MSPE. 

This paper is organized as follows: The model and the smoothness re- 
strictions are introduced in Section 2. Criteria for measuring the perfor- 
mance of different SC in determining the value of the smoothing parameter 
and some examples are briefly mentioned in Section 3. Section 4 comprises 
the optimality results. Section 5 considers a special case of smoothness 
restrictions, namely, ridge regression. In Section 6, the effects of autocorre- 
lated errors on the minimization of MSPE are discussed. Section 7 points 
out that the small-sample properties of different smoothing criteria cannot 
be expected to be similar. 



2. PRELIMINARIES 
Consider a linear model 

y„ = + €„, €„ ~ N(0,<r^ln), (2.1) 

where is a stochastic n X 1 vector of the dependent variable, Xn 
is an n X p matrix of the finite values of the independent variables at 
t = 1, • • - ,n, rank(An) = p; is a p X 1 vector of regression coefficients 
and €n is a stochastic n X 1 error vector. Furthermore, suppose the inde- 
pendent variables are non-trending: n^^Xl^Xn = > 0 for n > p and 

lim„_^oo = C > 0. 

In this paper we shall consider the estimator 

6fl(A) = + \R!R)-\X'^yn + XR'r), (2.2) 

where r is a fixed m X 1 vector and R a finite-valued m X p matrix of 
constraints, rank(i^) = m < p. Shiller (1973) derived (2.2) by assuming the 
existence of the following prior information about the coefficients of (2.1): 

R/3 ^ (2.3) 
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The parameter A in (2.3) is the ratio between the error variance of (2.1) 
and the prior variance of the variables of the prior distribution. Assuming 
that the prior is not degenerate, A is a finite constant and independent of 
n. Estimator (2.2) is the posterior mean of Estimator (2.2) can also be 
obtained by minimizing, with respect to 

q{p) = (y - XpY{y - xp) + As's, (2.4) 

where s = r — Rp. In (2.4), A is a Lagrange multiplier not given an explicit 
interpretation present in the Bayesian context. However, in finite samples 
A may be seen as controlling the trade-off between the sample information 
and the requirement that || s || be “small”. 

In polynomial distributed lag models r = 0. In ridge regression r = 0 
and R = I so that p = m. Non-par ametric regression also presupposes that 
r = 0 and in addition that X'^Xn, is a diagonal matrix with positive diagonal 
elements. Of course, if the model also contains ordinary linear regressors, 
then the latter assertion is no longer valid. 

Note that (2.2) is not a mixed estimator of Theil and Goldberger (1961). 
In their estimator, r is stochastic, and the stochastic properties of the mixed 
estimator are not the same as those of (2.2); see, for example. Judge et al. 
(1985, Chapters 3 and 22) and Ter^virta (1981). 

Finally, we shall need the residual sum of squares, 

na^(A) = {y, - XMX)Y {y, - X^X)} 

= na^ + 8^Sr,{X)RUnR'Sr,{X)8, 

where 8 = r-Rb, h = Ur^X^^y, = {Xl,Xr,)-\ 5„(A) = (A-^+EC/„R')"' 
and - XnUnX^^)yn. In the following, 5^(A) = Sn for brevity. 

3. OPTIMAL SMOOTHING 

In the above interpretation of (2.2), A is a fixed constant which may itself 
have a natural interpretation. We shall, however, concentrate on situations 
where A is not known but is determined from the data. This gives rise to 
the problem of “optimal” selection of A; there is a trade-off between the 
weight of the sample and non-sample information, and a criterion is needed 
to define optimal weighting. A popular way of defining optimal trade-off 
consists of applying a quadratic risk function 

g,(A) = E {hR{X) - py A {bR{X) - p} (3.1) 

with A > 0 and choosing the minimizer of (3.1) to be the optimal A. Note, 
however, that (3.1) may have several local minima. 
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In ridge regression, (3.1) with A = I has been applied to finding the 
ridge parameter. If the aim of model building is foreczisting, a possible 
choice is A = i.e., (3.1) becomes the MSPE. The moment matrix 

X'^Xn is often used in theoretical work instead of Xf^Xh where Xh consists of 
observations not included in the actual sample. We shall choose A = X!^Xn 
and consider minimizing (3.1) under that assumption. The aisymptotic result 
of this consideration is formulated in the following theorem. 

Theorem 3.1. Consider the linear model (2.1) and its smoothness estimator 
(2.2). Assume that 8^0. Asymptotically, the minimizer of the MSPE is 

Ao = tr {RC-^R')I[8'RC-^R'8), (3.2) 

The proof is contained in Appendix 1. 

Remark 1, Of course, limn->oo ^n(^) = <r^p for all finite A. However, Theo- 
rem 3.1 illuminates the situation in large samples and therefore is useful. 

Remark 2. If s = 0, the problem is different. In that case 6/e(A) is unbizused 
for any A > 0. It follows from the Gauss-Markov theorem that (2.4) is 
minimized by the restricted le 2 ist squares estimator 6/j(oo) = Br for all 
n > py and ^n(oo) = ^^(p ~ ^)* 

It was mentioned in the introduction that several methods of estimating 
A have been proposed in the literature. From the viewpoint of this paper, a 
particularly interesting one is the GCV method. Wahba (1975), Craven and 
Wahba (1979) and Golub et al. (1979) have suggested selecting A which is 
the minimizer of 



GCV(A) = a*(A) {1 - fc„(A)/n}-* , 

where 

=p- ti{S„RU„R') =p-m + tr(r„) (3.3) 

with 

Tn = {I + XRU„R')~^ = A-^S„. 

Expression (3.3) is called the generalized number of parameters of a re- 
stricted linear model. It is a monotonically decreasing function of A; 
A:,^(0) = p, and kn{X) | (p — m) as A — ► oo. In the latter case, 6ij(A) — ► bR 
in probability. 

The GCV method was originally derived for minimizing the MSPE. The 
concept of the generalized number of parameters may be used for a formal 
generalization, devoid of any further theoretical considerations, of existing 
MSC. The number of parameters in their penalty function is simply replaced 
by its generalization (3.3). Since this MSC generalization is only based on 
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a principle of analogy, optimality properties of these generalized criteria, 
henceforth called smoothing criteria (SC), have to be investigated. For that 
purpose, we introduce the following definition. 

Definition 3.1. A SC for determining A in 6 jr(A) is called optimal if, asymp- 
totically, given 5^0, its minimizer equals (3.2). 

Terasvirta and Mellin (1986) defined three basic types of MSC. Many 
well-known criteria can be shown to belong to one of these three categories, 
although there are important exceptions. We follow those authors by defin- 
ing 

SCl(A) = lnff*(A) + fc„(A)n-V(n,0), (3.4) 

SC2(A) = a*(A) + or*fc„(A)n“^/(n,p), (3.5) 

and 

SC3(A) = ^^(A) + ^^(A)Ajn(A)n“^/(n, A;(A)), (3.6) 

where = (n — p)"^(Vn ■” Xnh)\yn — / is a positive function of its 

arguments and > 0 as n — ► oo. Several SC’s which are generalizations 

of MSC fitting into the above classifications, are listed in Table 1. Note that 
the GCV method is simply a SC of type 3. Functions d^(A) are at least 
twice continuously differentiable functions of A. The asymptotic optimal- 
ity properties of nndnimizers (3.4)-(3.6) are characterized by the following 
theorem. 

Theorem 3.2. All SC of types 1-3 with limy^^oo /(w> ~ ^ are optimal. 

The proof is given in Appendix 1. 

From Table 1 it is seen that, along with GCV, there is a host of asymp- 
totically equivalent criteria with the optimality property of Definition 3.1. 
On the other hand, Shibata (1981) defined an asymptotic optimality crite- 
rion for ordinary MSC. The problem was to select a finite subset of regres- 
sors when the number of regressors was either infinite or growing to infinity 
with the number of observations so that p/n — ► 0. A MSC is optimal if 
it asymptotically minimizes the MSPE when predicting an infinite number 
of steps ahead. Shibata (1981) showed that a MSC of type 3 called S (see 
Table 1) is optimal. Consequently, all criteria asymptotically equivalent 
to S have the same optimality property; these criteria are those for which 
lim„_oo f{n, •) = 2. 

A well-known MSC which does not fit into the three categories of 
Ter^virta and Mellin (1986) is BIC (Sawa, 1978). Nevertheless, it can in 
principle be generalized in the same way as the above criteria. In Appendix 
2, it is seen that GBIC is optimal in the sense of Definition 3.1. 
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Table 1. Smoothness Criteria Generalized from MSC 
and the Corresponding Limits of /(n, •). 



SC 


Type 


/(«.•) 


l^/(n,-) 

n — ►©© 


AIC (Akaike, 1974) 


1 


2 


2 


SBIC (Schwarz, 1978) 


1 


In n 


00 


HQ (Hannan and Quinn, 


1 


2 In In n 


00 


1979) 


Cp (Mallows, 1973) 


2 


2 


2 


BEC (Geweke and Meese, 


2 


(1 - p/n)~^ln n 


oo 


1981) 


URV (UnbiEised Residual 


3 


{l-fc„(A)/n}-^ 


1 


Variance; Theil, 1961) 


PC (Amemiya, 1980) or 


3 


2 {1 - *!n(A)/n} ^ 


2 


FPE (Akaike, 1969) 


GCV (Golub et al., 1979) 


3 


{2 - kn{X)/n} 
x{l - fc,»(A)/n}“* 


2 


S (Shibata, 1981) 


3 


2 


2 


Sp (Hocking, 1976) 


3 


[2-{MA)-l}/(n-l)] 
x{l-fc„(A)/n} ^ 
x{l-fc„(A)/(n-l)}-i 


2 


T (Rice, 1984) 


3 


2{1 - 2fc„(A)/n}-^ 


2 


BIG (Sawa, 1978) 


• 


• 


2 
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4. OTHER ALTERNATIVES FOR CHOOSING 
THE SMOOTHING PARAMETER 

The previous section discussed smoothing criteria, but other consistent 
methods of estimating A are available. One way of approaching the problem 
is to take the minimizing equation (A1.5) and operationalize it. Since 

Eis'Pns) = a'PnS + <r*n-Hr {r3(i?C7-ii?')'} , 

one may replace both sides of (A1.5) by their unbiased estimators to obtain 

A[I'P„I - a*n-Hr {T^{RC~^ R')^}] = ?Hr(P„). (4.1) 

Solving (4.1) may give negative values of A so that we may prefer a positive 
part variant 

X{s'T^RU„R' 8 - ahr {T^iRU„R'y}]+ = ahi{T^RU„R% (4.2) 

where [a]+ = a if a > 0 and [a]+ = 0 otherwise. If the l.h.s. of (4.2) equals 
zero then a restricted least squares estimator (A = oo) is selected. 

A computationally simple alternative is to operationalize the asymptotic 
solution (3.2). A unique consistent positive part estimator corresponding to 

(4.2) is 

^ aHi{RUr,R') 

+ “ [s'RUnR's - aHi{{RUnR')^}]+' 

Another means of determining A is to follow a Bayesian line of argument. 
Although 6ie(A) has a Bayesian interpretation, (4.2) is not an empirical Bayes 
solution. Assuming R/3 ^ where A is a constant, the marginal 

distribution of i = r - Rb is iV(0,<7^A"^T~^). Thus ^ 

Operationalize the x* statistic by substituting for (7^ and set 

a~^\8*Tn8 = m, (4.3) 

where m is the expectation of (7“^As'TnS. Solving (4.3) for A yields an 
empirical Bayes estimate of this parameter. Letting n — ► oo, the l.h.s. of 

(4.3) converges in probability to a^^Xs's. The asymptotic solution is 

A = a^m/{8^8), 

which coincides with (3.2) if RC'~^R* = I. 

Thurman et ai. (1984) suggested the following procedure. The necessary 
and sufRcient condition for 

E{{b - fi){b - PY) - E[{bR{X) - {bR{X) - ^Y] > 0 
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is 

a-^s'{2X-^I + RUr,R')~h < 1. (4.4) 

Replace the inequality in (4.4) by an equality and substitute leeist squares 
sample values for the parameters. The value of the smoothing parameter is 
the solution of 



^1=1. (4.5) 

The corresponding asymptotic solution is 

A = 2(7^/(s's). 

For a discussion of the rationale of (4.5), see Swa ’ and Mehta (1983). 

5. RIDGE REGRESSION 

The above results are directly applicable to ridge regression. As men- 
tioned above, the usual ridge estimator is (2.3) with R = I and r = 0 so 
that s = ~p. Equation (A1.5) for the minimizer becomes 

X^'T^C-^p = <rhr{T^C-^), (5.1) 

where Tn = {/ + (A/n)C“^}“^. This can be operationalized as above. On 
the other hand, Dempster et ai. (1977) have discussed minimizing (3.1) 
w.r.t. A when A — I, The minimizing equation is, in our notation 

\p'nC-^P = ahx{TlC-% (5.2) 

Dempster et al. (1977) suggested operationalizing (5.2) by substituting 6 
for P and for and called the resulting ridge estimator SRIDG. If the 
l.h.s. of (5.2) is estimated without bias, an operational minimizing equation 
becomes 

xwnulh - a^tr(r»C^»)]+ = a\v{TlUl). 

For [ ]^ = 0, the estimate of P equals zero. 

From (4.3) we have 

Xa-H'Tr,b = p. (5.3) 

The corresponding ridge estimator was called RIDGM by Dempster et al. 
(1977). Letting n — ► oo in (5.2), we obtain a unique solution for A: 






(5.4) 
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The corresponding asymptotic solution for A from (5.3) is 

Ao = a'^pl{P'P). (5.5) 

For X!^Xn = n/n, n > p, (5.1) and (5.2) equal (5.5). Solution (5.5) is also 
an asymptotic expression of the ridge parameter A = a^p/ (6'6) of the HKB 
estimator; see, for example, Hoerl et al. (1975) and Judge et al. (1985, 
Chapter 22). In the orthogonal case, (5.3) can be written as 

(1 + p/n)X = cr^p/(b'b'). 

Thus SRIDG, HKB and RIDGM are close to each other if X^Xn is nearly or- 
thogonal. HKB and RIDGM can also be expected to yield similar estimates 
if the sample size is large. In the extensive simulation experiment of Gibbons 
(1981), RIDGM and HKB together with GCV were the best estimators. The 
MSE {A = I) was used as the principal measure of performance. 

SRIDG, HKB and RIDGM lead to A = Op(l). This is not true for all 
rules suggested in the literature. Lawless and Wang (1976) proposed the 
minimizer (LW) 



A = af*p/(6XAfn5) = ff^p/{nb'C„b). (5.6) 

It is seen to be Op(n“^). This feature of the LW estimator has not been a 
cause of concern in Monte Carlo experiments. One reason is that in simu- 
lation studies, the sample size has traditionally been fixed. Moreover, the 
regressors have usually been standardized so that X^^Xn is in correlation 
form. This amounts to having n = 1 in (5.6). However, in applications with 
non-standardized data, the diflFerence in order is likely to have considerable 
influence upon the results. 



6. AUTOCORRELATED ERRORS 

In their study of the relation between electricity demand and weather, 
Engle et ai. (1982) estimated a relationship between those two variables 
using non-parametric regression and GCV. They assumed that the errors of 
the model were an AR(1) process. The estimation results from minimizing 
the GCV were surprisingly insensitive to the values of the autocorrelation 
coefiicient. In this section, possible reasons for this phenomenon will be 
considered. However, the starting-point will not be in the estimation of 
parameters with autocorrelated errors but the usual OLS estimation ignoring 
the autocorrelation. 
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Assume that in (2.1), ~ iV(0,a*Sn) where is an n x n posi- 

tive definite matrix standardized to have ones on the main diagonal. Set 
= Dn- Suppose > 0 for n > p and that lim„_^oo Dn = 
D > 0. Use bR{X) as the estimator of ^ as in the case of independent errors. 
Then it turns out that, aisymptotically, the value of A that minimizes the 
MSPE is 

Xo = <rhT{RC'-^DC-^R^)/{8*RC^^R^8) 

= Xo + a^ti{RC-'^FC-^R*)/{8^RC''^R^8), 

where F = lim„_,.oo = C — D. The modifications in 

the proof of Theorem 3.1 needed to obtain (6.1) are in Appendix 3. The 
second term on the r.h.s. of (6.1) can be either positive or negative. The 
bias component of qn{X) is not affected by the autocorrelation so that the 
denominator of (6.1) is the same as that of (3.2). 

Expression (6.1) is only an asymptotic result, but its simplicity is an 
advantage when the effect of autocorrelated errors is considered. There are 
at least two situations where^the MSPE is insensitive to autocorrelation. 
First, suppose s « 0 so that A© is large. Then the minimum is usually flat 
and even relatively large shifts in A have little influence on the MSPE or the 
coefficient estimates of the model. A large sample has the same effect: for a 
wide range of values of A, the coefficient estimates will not change much. 

It is possible to find upper and lower bounds for (6.1). Using a result of 
Rao (1973, p. 62), we have 

( 6 . 2 ) 

where > ■•■> are the eigenvalues of = n. We may 

obtain upper and lower bounds for the r.h.s. of (A3. 3) by (6.2). If one takes 
those bounds and lets n oo, one obtains: 

PlAo <Xo< fiuXo, (6.3) 

where px, = limn-»oo = liron-^oo When the error process 

of the model is stationary with an absolutely summable covariance function, 
px, and fjLu are the minimum and the maximum values, respectively, of the 
spectral density of the process, cf. Fuller (1976, section 4.2). 

The preceding discussion can easily be modified to apply when the true 
model has c ^ iV(0,a^E), whereas the GLS estimator of P is based on the 
assumption c ^ iV(0,a^Eo), S S<,. When the MSPE is flat between the 
lower and upper bounds (6.3), the autocorrelation does not have noteworthy 
influence upon the prediction accuracy of the model. In the weather and 
electricity example of Engle et al. (1982) the values of A are relatively high 
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throughout. This may be concluded from the actual generalized number 
of parameters which is substantially closer to p ~ m than to p in all their 
estimated models. As already mentioned, the minimum of the MSPE can 
be expected to be fairly flat if it occurs at high values of A. 

This does not, as such, explain the fact that the coefficient estimates 
in the application of Engle et aJ. (1982) are insensitive to autocorrelation. 
Large fluctuations in the estimated coefficients are possible although the 
MSPE changes little. However, in the case of non-parametric regression 
X'^Xr. is diagonal. In an extreme case, if the observations of the independent 
variable are fairly uniformly distributed into groups {j observations in each 
group) defined by the model builder, then X'^Xn = pj = n. Even then 
it is still possible, but less likely, that small changes in MSPE are associated 
with large changes in estimated coefficients. Thus a possible explanation to 
insensitivity noticed by Engle et ai. (1982) is a flat minimum of MSPE. It 
may occur together with a large A when a minimizer of types (3.4)-(3.6) is 
applied. 



7. SMOOTHING IN SMALL SAMPLES 



It is seen from Table 1 that many of the smoothness criteria generalized 
from the MSG are optimal. However, their small sample properties may 
differ widely and not necessarily uniformly. Rice (1984) has underlined this 
point when discussing a related topic, the choice of bandwidth for non- 
parametric regression. In that paper, various model selection criteria were 
appropriately modified to produce bandwidth estimates. It was shown that 
some modified criteria are optimal in the sense that the minimizers converge 
in probability to the MSPE minimizer as the number of observations goes 
to infinity. 

Rice (1984) argued that many criteria in fact undersmooth in small sam- 
ples. He therefore suggested another criterion (T) which penalizes under- 
smoothing more heavily than the other criteria he considered. Its SC modi- 
fication appear in Table 1. In a Monte Carlo experiment T did perform best 
while some other criteria, most notably AIC, FPE (PC) and S frequently 
undersmoothed and performed poorly. Those findings clearly indicate that 
a study of smoothing in regression is not complete without consideration 
of the small sample properties of the various smoothing criteria and other 
estimation methods discussed in this paper. At the moment we do not yet 
have any small sample results available, but work on the problem is under 
way. 
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APPENDIX 1: PROOFS OF THEOREM 3.1 AND 3.2 

Proof of Theorem 3.1. Consider model (2.1), estimator (2.2) and its 
MSPE. Write MSPE as follows: 

g„(A) = tr(X;,X„)£;[{6fi(A) - {6 h(A) - p}'\. (Al.l) 

We have 

6b(A) -13= {I- Ur,R'S,,R)U'„X'^e„ - U^^R'S^s, 

where = {X'^Xn)~^ and s = r — Rj3. The expectation in (Al.l) can be 
written as follows: 

E[{bR{X)-mbR{X)-pY] 

= a^Un{I - R'Sr,RU„Y + U„R'SnSs'S„RU„ 

= - n-^R'S„RC-^y 

+ n-^C-^R'S„ss'SnRC-\ 

where C„ = n~^X'^X„ and S„ = {X~^ I + n~^ RC~^ R')~^ . Thus 

9„(A) = o-^tr(7 - n-^R'S„RC~^Y + n~^ s' S^RC~^ R' a 

= aHr {I - {X/n)R'T„RC-^y + {Xyn)s'T^RC-^ R' s, 

where T„ = X~^S„ = {/ + (A/n)iJC-ifl'}"\ because S„ and RC~^R' 
commute. The first term on the r.h.s. of (A1.3) is 0(1). Only if it is 
assumed that the minimizer A = O(n^), 6 < 1/2, does the bias term in 
(A1.3) remain finite 2 is n — ► oo. Thus, the case of ^ > 1/2 is not considered 
because it does not lead to a minimum of (Al.l). 

If one differentiates (A1.3) w.r.t. A, one obtains: 

q^X) = -2ahr{I - n~^R' S„RC-^)R'{X~^Sr:fRC-^ 
->r2X-^n-^s'SlRC-^R'8 

= -2<T'‘n-^tt{T^RC-^R') + 2(A/n)s'T®i?C'-^i?'s. (A1.4) 



(A1.2) 

(A1.3) 
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Note that < 0 for any finite n > p. If one sets (A1.4) equal to zero, 

one obtains: 

Xs*Pn8 = a^tr(Pn), (A1.5) 

where = T^RC~^R', For 8 < 1/2, Tr,-^ I and g„(A) aan^oo. 

Thus, a unique asymptotic solution for (A1.5) exists and is given by: 

A = <7Ht{RC-^R^)I[8^RC-^R‘8). (A1.6) 

The r.h.s. of (A1.6) is finite when s ^4: 0, i.e., ^ = 0. 

Proof of Theorem 3.2. From (2.5) we have 

a^{X) = + (A/n)2s'r^PC“^P'l. (A1.7) 

If one differentiates w.r.t. A, one obtains: 

a^(A)' = 2(A/n2)|'P„s (A1.8) 

and 

= -n"Hr(r*flC'-^ JJ') = -n"^ {tr(P„) + (A/n)a(A/n)} , (A1.9) 

where a(A/n) = tr(PnPC“^P'). To find the minimizer, differentiate (3.4- 
3.6) to obtain: 

I^SCl(A) = a*(A)'/a»(A) + *;(A)n- V(n, 0), (ALIO) 

^SC2(A) = a*(A)' + k'„{X)w*n-^ f{n,p), (Al.ll) 

^SC3(A) = ff*(A)'[l + fc„(A)n" V {«, fcn(A)}] 

+ {«> + KWf {", *n(A)}]. (A1.12) 

The minimizing equations are 

^2(A)' = -ff*(A)fc;(A)n~V(«,0), 

^*(A)' = -a*A;(A)n-V(n,p), 

^*(A)' = -a^{X)k'„{X)n-^fi {n,fc„(A)}, 



where 



fi {n, K{X)} = [f {n, A:„(A)} + *:„(A)/' {n, A:„(A)}] 
X [l + n“^A:„(A)/{n,A:„(A)}]“^ 
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It is seen from (A1.7) that if A = Op(n^), ^ > 1, ^^(A) is an inconsistent 
estimator of < 7 ^ with an asymptotic upward bias. Thus it is sufficient to 
consider the case ^ < 1. If one applies (A1.7-A1.9) one may write (Al.lO- 
A1.12) as 

2As'P„s = + {X/n)H^T^RC-^R's} 

X {tr(Pn) + (A/n)a(A/n)}/(n,0), (A1.13) 

2As'Pn« = Or(Pn) + (A/n)a(A/n)}/(n,p), (A1.14) 

and 

2Xs'P„i = {a* + (A/n)Vr2j?c-^iZ'l} 

X {tr(P„) + (A/n)a(A/n)} fi {n, fc„(A)} (A1.15) 

respectively. Note that for 5 < 1, f {n,kn{X)} — ► 0 as n — » oo for all type 
3 criteria considered here. Consequently, 

lim /i {n, fc„(A)} = lim /{n,fc„(A)}. 

n — ►©© fi — >oo 

Letting n — > oo, we have — > <7^, <r^ and s — ► s in probability. If 

one assumes ^ < 1, equations (A1.13-A1.15) become 

2\8' RC~^ R* 8 ahx UrC-^ R') lim /(n,-)}. (A1.16) 

From (A1.16) it is seen that SC with lim^j-^oo /(^» •) = 2 are optimal. 



APPENDIX 2: GENERALIZING SAWA’S CRITERION 



Generalizing Sawa’s BIC using the generalized number of parameters we 
have 



GBIC(A) = Ind^(A) + 2n ^ {A^n(A) *f 2} (c<^^/d-^(A)} 

- 2n-^ W}" . (A2.1) 

where is an estimator of the variance of the “pseudo- true” model, cf. 
Sawa (1978). If one differentiates (A2.1) w.r.t. A, one obtains: 



, 2i-' fAW-l-^ 

^2(;^)+2*„(A)n . 2 ^^^ 



2 {fc„(A) + 2}n +4n - 0. 



(A2.2) 
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Let Xo be the value of A minimizing (A2.1) . Furthermore, interpret the 
pseudo-true model as the model where the prior and sample information is 
combined optimally; for the pseudo-true model, (A2.1) attains its minimum. 
Then = d^(Ao), and (A2.2) becomes 

{l - 2n-^K{X)} = -2k'„{X)n-^ 

or 

= -&^{X)k'„{X)n~^h {«, ^n(A)} , (A2.3) 

where 

/2 {n, kn{X)} = 2{1- . 

Equation (A2.3) may be used for solving A numerically. Note that 
limyi_>oo = 2 so that (A2.1) is also an optimal SC. 

APPENDIX 3: MODIFIED PROOF OF THEOREM 3.1 IN THE 
PRESENCE OF AUTOCORRELATED ERRORS IN MODEL (2.1) 

Suppose €n ^ A(0,cr^En) in (2.1) where > 0. Then 

£[{6r(A) - /3}{6fi(A) - PY] = a\Un - U„R’S„RU„) 
XS„X„(£A„ - U„R'S„RU^) 

+ UnR' SnSs' SnRU„, 



SO that 



g„(A) = E\{bR{X) - PY {1 >rW - P}\ 

= - n-^R'S,,RC-^Y} 

+ n-^s'SlRC~^R's. (A3.1) 

If one differentiates (A3.1) w.r.t. A, one obtains 

q'„{X) = 2a^n-hT {D„C-\I - n-^R'S„RC-^)R'{X-^Sr,YRC-^} 

+ 2n-h'{X-^S„YRC-^R'S„s. (A3.2) 

Then if one sets (A3. 2) equal to zero one may show, after some manipulation, 
that, 

As'PnS = ahi{T^RC-^D„C-^R'). (A3.3) 

If one lets n — ► oo in (A3. 3), one obtains the desired result. 
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A FAST GRAPHICAL GOODNESS OF FIT TEST FOR 
TIME SERIES MODELS 

ABSTRACT 

The oscillatory appearance of stationary time series is captured very 
economically by only a few higher order crossings which in addition contain 
a great deal of the spectral content of the process. A useful approximation 
to the variances of higher order crossings is discussed and is applied in the 
construction of probability limits for the hypothesized higher order crossings. 
From this, a graphical display of higher order crossings together with their 
probability limits provide a fast goodness of fit test. Examples illustrate the 
applicability of this device. 



1. INTRODUCTION 

There has been a growing interest in graphical methods in time series 
analysis and especially so since the popularization of electronic devices with 
graphics capabilities. In following this trend, the present article discusses a 
certain zero-crossings based graphical technique useful in testing for good- 
ness of fit of time series models. The idea is to use plots of higher order 
crossings which are akin to plots of the correlogram and spectral densities or 
the periodogram, but which have the advantage of great simplicity. Under 
the Gaussian assumption, the sequence for expected higher order crossings 
is equivalent to the autocorrelation function and hence to the normalized 
spectral distribution function, but it summarizes the data differently. In 
this regard, the monotone property of higher order crossings plays an in- 
strumental role since the initial rate of increase exhibited by higher order 
crossings proves to be an effective summary feature. As the higher order 
crossings continue to increase, their rate loses its discrimination potency 
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since different processes seem to share similar rates. This is why in general 
very few higher crossings are used in testing goodness of fit. 

The present paper gives an overview of our previous work, particularly 
Kedem and Reed (1986) and Kedem (1985), to which the reader is referred 
for mathematical details and more examples. 



2. PLOTS OF HIGHER ORDER CROSSINGS 

Let {zt}, t =■ 0,±1,..., be a zero mean stationary Gaussian process 
with correlation function pj and normalized spectral distribution function 
F, and let V be the difference operator, Vzt = Zt — ^t-i- It is convenient to 
introduce the clipped binary process 

^ / 1 , > 0 

* \ 0, otherwise, k= 1,2..., 

which gives rise to the indicator at time t 

^(fc) ^ / 1 , 4-1 

i 0, otherwise. 

The higher order crossings of order A:, defined by 

Dk,n = 4- . . . H- 

It is seen that Dk,n counts the number of axis-crossings in the {k — l)th 
differenced series . . . , -C)i,n then is the usual number of 

zero- or axis-crossings by the original series ziy...,Zn. 

From the point of view of the theory of stationary Gaussian processes, 
the sequence of higher order crossings is equivalent to the correlation and 
spectral structures. This is stated precisely in the following theorem. 

Theorem 1. Let {^t} be a zero mean stationary Gaussian process with cor- 
relation function pj. Then the sequence {pj} is completely determined from 
the sequence {£^(Dy^n)}- That is, pk is determined by E{Di^n)y • • • > -E^(^A:,n)- 
Proof, From Kedem and Slud (1981), 

_ -( “0 +Ci [(”) + (»-«)] - ■■■ + (-l)V+i 



and the pj can be determined recursively from the E{Dk,n)- 
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Obviously it is also true, from (l), that knowledge of {pj) is equivalent 
to knowledge of the sequence {E{Dj^n)}- follows that F is completely 
determined by the sequence of expected higher order crossings. This is 
summarized by the symbolism 

{E{DJ,r^)}^{Pk}^F, 

Thus, exactly for the same reasons that plots of pk and F are extensively 
used in time series analysis, it is useful to observe plots of higher order 
crossings too. 

The main thing to observe in plots of higher order crossings is the rate 
at which they incresise and the starting point Di^n- The fact that higher 
order crossings tend to increase can be attributed to the general fact that 

Dj^n ^ ^j + l,n + 1 

with probability one. Hence the Dj^n tend to increase with j for fixed but 
large n. See also Kedem and Slud (1981). 

It is instructive to observe plots of higher order crossings and thus mo- 
tivate the central idea of this paper. Figure 1 displays plots of ten higher 
order crossings Di,iooo> • • • , -C^io.iooo, obtained from first order autoregres- 
sive processes with different parameter values <j>. It is seen that the initial 
rate of increase and starting point differ from process to process, but that 
as the order increa-ses the rate is almost independent of the parameter. This 
same behavior has been observed in numerous cases which may be inter- 
preted to mean that only the very first few higher crossings carry sufficient 
information which discriminates clearly between different processes. 

Accordingly, it is suggested that plots with as few as six values of Dj^n 
can be useful in goodness of fit testing. At the same time it should be noted 
that crossings of high order carry information also but this information is 
less amenable and will not be used here. 



3. THE VARIANCE OF HIGHER ORDER CROSSINGS 

The probability distribution of the Dj^n is quite intractable and we shall 
concentrate on the more modest problem of approximating the variance of 
higher order crossings needed for the proposed goodness of fit test. 

In general, the variance of is a function of the fourth order cumulant 

function /c® ^(r, s, t) of which is summable under appropriate moment 

conditions. Thus for j = 1 the following asymptotic result was proved by 
Kedem (1980). 
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Figure 1. Plots of Dy.iooo; j = 1, . . 10, from Zt = + Ut, u* are 

independent iV(0, 1) random variables and <t> = 0.75, 0.5, 0.25, 0, -0.25, 
-0.5, -0.75. 

Theorem 2. If pj is absolutely summable then 

XI 1 *:) I < oo 

k= — oo 



and 



where 



y/n 

1 oo ^ 

<rl = —^ ^ [(sin“Vfc) +sin~Vfc-isin“Vfc+i 

k=—oo 

+ 4jrV^^^(l,-A:, 1 - A:)]. 



The same result applies to every Dj^n provided the correlation functions of 
decays quickly to zero. However /c® ^ is not known in general which 
makes the above result impractical. 

Another approach is to hold n fixed and let j increase. In this case 
it is possible to obtain a useful asymptotic result under the assumption of 
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m-dependence. Assume that tt is a point of increase for F and let 

= Pr = 1 I = l) . 

Then — > 0 as A; — > oo and it was shown by Kedem and Reed (1986) that 

cov(4''\dW) = o(Ai''>). (2) 

The proof of this fact depends on the differential properties of the 
correlation function of art}. (2) readily yields the following theorem. 

Theorem 3. Let {zt} be an m-dependent stationary Gaussian process and 
assume that tt is a point in the support of F. Then for fixed n 



lim 

k-*oo (n — 



Var(Djfc,n) 



(n-l)Ar(l-Al'')) 



1 . 



This result was used in the construction of probability limits for the higher 
crossings under the hypothesis of white noise. However, the assumption of 
m-dependence cannot always be verified and thus another approximation 
should be used. 

A rather close approximation to the variance of Dj^n can be provided 
if it is assumed that the binary sequence is a Markov chain. This 

first order approximation has been found very satisfactory by an extensive 
simulation. 

Define the parameters associated with the chain, 



and 



p(fc) = 1 _ A^, qW = Ai"), 



1/W = 



1 - 2A^*^ + A^*’^ 
2 (l - 



When the process is a stationary Gaussian autoregressive- moving average 
process with known (or hypothesized) parameters, and are known 
too explicitly. Then if is a Markov chain it can be shown (Kedem, 

1985) that 



Var(I)fc,„) = (n-l)p(^)9(*) + 



2pWqW - pW) 

1 — j/W 



[(n-l)-n,,,], (3) 



where 



jyW _ {k) 



n-ll 
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This approximation has been compared (Kedem, 1985) with actual estimates 
obtained from 100 independent realizations each of length n = 1000. The 
results are given in Table 1. Although E^(Dy^iooo) are known explicitly when 
the parameters are known, these expectations are estimated too as a check 
of the whole simulation. It is seen that (3) agrees well with the simulation 
results. An algorithm for obtaining is given by Kedem (1985). 

4. A GRAPHICAL GOODNESS OF FIT CRITERION 

The proposed goodness of fit test is based on deviations of the observed 
path of higher crossings from the expected path where the latter is obtained 
under the hypothesis of an assumed model. Marked deviations of the ob- 
served path from the expected one suggest that the observed process does 
not oscillate as expected. The closeness of the two paths can be measured 
by appealing to (3) and to conditions under which the Dj^n 3.re asymptot- 
ically normal. It can be shown, using the technique of Cuzick (1976) that 
when {zt} is Gaussian the condition Y2 \ Pk \ < oo implies the asymptotic 
normality of the Djb,n- H follows that approximate 95% probability limits 
for Dk^n are, for each k and sufficiently large n, 

(n - ± 1.96{Var Dk,„y^\ (4) 

where Var(jDjb^n) is given by (3). When at least one observed j = 

1, ... ,6, lies outside the limits (4) the assumed model under which (4) was 
derived is rejected. Before discussing the power of this test it is illustrated 
by a few examples. 

4.1 Examples 

Annual Mean Temperature. The graph of the annual mean air temperature 
from 1781 to 1980 at Hohenpeissenberg, Germany, is given in Figure 2. 
Actually the observations for 1811 and 1812 are missing and were replaced 
by the mean of neighboring observations. This has only a very small effect 
on the sequence of higher crossings. 

Since annual temperature is hard to predict, we could ask, does the series 
oscillate as white noise? The answer is obtained from Figure 3 where it is 
seen that the higher order crossings are well within the bounds (4) so that 
at least in this sense the series resembles white noise. For comparison, the 
figure portrays the higher crossings of simulated white noise which also fall 
within the bounds £is expected. 

ARM A Models. Figure 4 shows the probability limits (4) under various 
hypotheses; these are white noise, second order autoregressive process with 
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Table 1. Comparison of (3) with the standard deviation obtained 
from 100 independent realizations of size 1000. 

E{Dj\iooo) and jE ? ^ are rounded to the nearest xnteQer . 

Series j E{Dj^iooo) ^(-Dy,iooo) {^('^j.iooo)}^ {^(^j.iooo)}^ 

From 100 From (3) From 100 

Realizations Realizations 



White 


1 


500 


497 


15.81 


15.96 


Noise 


2 


666 


666 


13.15 


13.63 




3 


732 


732 


12.16 


12.53 




4 


769 


770 


11.57 


11.49 




5 


794 


795 


11.18 


11.05 




6 


813 


814 


10.82 


10.00 


AR(2) 


1 


424 


425 


9.64 


9.67 


d 

11 


2 


484 


485 


9.38 


9.13 


d 

1 

II 


3 


536 


537 


10.29 


10.81 




4 


594 


594 


11.27 


12.72 




5 


651 


652 


11.87 


12.02 




6 


702 


701 


12.04 


11.34 


ARMA(1,1) 


1 


552 


552 


14.62 


14.74 


II 

p 


2 


679 


679 


12.96 


12.87 


II 

p 


3 


737 


737 


12.09 


12.05 




4 


773 


772 


11.27 


11.52 




5 


797 


797 


10.70 


11.12 




6 


814 


814 


10.15 


10.80 


ARMA(2,2) 


1 


884 


883 


10.04 


10.51 


T— J 

1 

II 


2 


897 


897 


9.20 


9.53 


(f>2 — —0.5 


3 


903 


903 


8.84 


9.01 


6i = 0.2 


4 


908 


908 


8.60 


8.50 


02 =0.1 


5 


911 


911 


8.43 


8.47 




6 


914 


914 


8.29 


9.38 



parameters 0.4 and —0.7, and second order autoregressive moving average 
process with parameters ^ —0.5), 0 = (0.2, 0.1). The actual 

were obtained from simulated data given in an appendix by Priestley (1981). 





Figure 3. Probability limits for the higher order crossings from the temper- 
ature series. The series oscillates as white noise {WN). 




Figure 4, Sample higher order crossings paths fall within their respective 
limits. 

It is seen that the three processes display different oscillation patterns, 
which are captured very economically by only six higher order crosssings. 
The ARMA (2,2) process is most oscillatory while the AR(2) is much 
smoother. 

Signal Detection. Figure 5 displays two series which appear to be very similar 
except perhaps for scale. However their higher order crossings quickly reveal 
that the first one oscillates as white noise while the other oscillates roughly 
as a low order autoregressive process. This is illustrated in Figure 6. 

Diagnostic Check. In testing the goodness of fit of a model one runs a 
residual analysis which usually tests whether the residual series constitutes 
white noise. Consider series A, D of Box and Jenkins (1976, p. 293); the 
fitted models are 



series A : Vzt = ut — 0.7ut-i 



and 



series D : Zt — 0.87^t-i = 1-17 + ut 





Figure 6. The higher order crossings paths of series (a), (b). The first path 
is within white noise bounds. 
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where {ut} is the residual series. Figure 7 however reveals that the two 
residual series are not quite white noise as signified by the axis-crossings 
being outside the limits (4). It is interesting to note though that the rest of 
the higher order crossings behave as those of white noise. Thus, except for 
smaller Di^ny the two residual series oscillate as white noise. 




Figure 7. Diagnostic check applied to the residuals of series A (n = 177) 
and series D [N = 290). Di^n outside the limits (4) for white noise. 



4.2 Power Simulation 

The limits (4) provide approximate 95% bounds for each value of Dj^n- 
However our test is based on Di,n • • • > ^6,n simultaneously and the hypoth- 
esized model is rejected if at least one value of Dj^n falls outside the proba- 
bility bounds. It is expected that a test which is based on more than a single 
Dj^n has a higher probability of rejecting a true hypothesis than 0.05 and in 
fact our experience indicates that with six values of Dj^n this probability is 
about 0.1. The exact probability is still an open problem. 

An indication of the power is provided in Table 2 which gives the power 
for testing the hypothesis of white noise where the alternative is the indicated 
process. The power is estimated from 50 independent series each of size 450. 
Similar results were obtained for greater series lengths. 
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Table 2. Power simulation for testing white noise 
versus the indicated process. 

Process Power 



White Noise .10 

AR(1), (^6=: .05 .26 

MA(l),^ = .l .40 

AR(1),<^=.2 .90 

AR(1),<^=.5 1.00 

AR(2), .1, <l>2 = -.15 .88 

ARMA(1,1), <l>i = .1, 0i = -.1 .86 

ARMA(2,2), </>i = .1, <t>2 = -.4, 0i = 0, 02 = .3 1.00 

ARMA(2,2), <f>i = .1, <t>2 = -.2, 0i = .2, $2 = .1 .88 
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OUTLIERS IN TIME SERIES 
1. INTRODUCTION 

Every experimenter has at some time or other faced data which seem to 
contain some deviant or “outlying” observations. The problem of outliers in 
data is an old one and was one of the first to receive a statistical treatment. 

One of the early references to the rejection of outliers seems to have 
been a remark by Bessel (see Anscombe, 1960) in a geodetic study; he re- 
marked that he had never rejected an observation merely because of its large 
residual. An early attempt at developing a rejection criterion based on prob- 
ability reasoning was that of Peirce (1852). He developed an outlier rejection 
criteron and applied it to 15 observations of the vertical semi-diameters of 
Venus made by Lt. Herndon, with the meridian circle at Wzishington in 
1846. This started a lively debate which continues until today. 

Major developments in the area of outlier rejection have been mainly in 
the estimation of a location parameter from a random sample or formulation 
of the parameters of a linear regression model when there is the possiblity 
of outliers. Approaches to this problem include: 

(1) Significance tests (Dixon, 1950; Grubbs, 1950; etc.), 

(2) Premium protection (Anscombe, 1960), 

(3) Decision theoretic methods (Ferguson, 1961), 

(4) Bayesian methods (de Finetti, 1961; Box and Tiao, 1968; Guttman, 
1973; Abraham and Box, 1978), 

(5) Robustness (Huber, 1964, 1981; Hampel, 1974), and 

(6) Diagnostics (for example. Cook and Weisberg, 1982). 

Papers in this area seem to classify the objectives of an analysis as fol- 
lows: (i) making inferences about some parameters, (ii) making a general 
study with a view to gaining broader understanding of the problem without 
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necessarily making inferences about the parameters, and (iii) discovering if 
outliers are present and then singling them out for further study. The ap- 
proaches listed above give varying degrees of importance to these objectives. 
For example, the ‘Robust approach* gives importance to the first objective, 
while the Bayesian approach addresses all three objectives. 

Discussion of outliers in the context of time series is rather recent. In 
this paper we summarize some of the important developments and introduce 
some recent work in this area. 



2. CHARACTERIZATION OF OUTLIERS IN TIME SERIES 

Suppose Zt, (t = 0, ±1, . . .) is a time series observed at equally spaced 
intervals of time. Consider the familiar time series model 

Zt = 

where 

is a backward shift operator such that Bat = at_i and {a*, ^ = 0, ±1, . . .} 
is a sequence of independent identically distributed normal random vari- 
ables with mean zero and variance Often is expressed as a ratio 

9{B)/(I>{B) of finite moving average and autoregressive polynomial operators. 

One can characterize an aberrant observation at t = g by the model 
2/t = where y* is the actual observation at time t, It{q) = 1 

ii t = q (i.e., if the qth. observation is aberrant) and It{q) = 0 otherwise, 
and (jJi is the amount of shift. This model is referred to as the aberrant 
observation (AO) model (see Abraham and Box, 1979) or as the Type I 
model (Fox, 1972). Alternatively it might be that the aberration affects the 
innovation a.t t = q. This can be represented as zt = ^(B)(at + 0 J 2 U{q)) 
where "^{B) and at are defined as before, It{q) = 1 ii t = q (if the gth 
innovation is abberant) and It{q) = 0 otherwise. This is referred to as the 
aberrant innovation (AI) or Type II model. 

Obviously one can embed the AO and AI models in the transfer function 
via the intervention model approach: 

yt = {uj{B)/S{B))It{q) + nB)at, 

where o;(B) and ^(H) are finite polynomial operators. If one lets oj{B) = ui 
and 5(jB) = 1 one obtains the AO model; alternatively, letting oj{B) = oj 2 
and S(B) = ^(B) yields the AI model. 

The implications of the AO and AI models can be seen clearly by con- 
sidering a special case ^(B) = 1/[1 — 0B], | ^ | < 1. This is the case of the 
well known first order autoregressive process {AR(1)}. 
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(i) AO: zt = -h flt> !/t = Figure 1(a) shows the plot of 

Vt vs. yt-i where yt represents a typical set of observations from the 
AR(1) model with an aberration at t = g. The points (yg_i,yg) and 
(yq)!/g+i) 2 ire not consistent with the least squares line because y^ is 
aberrant. Although only one observation is aberrant, two points are 
affected. Thus the least square estimate of <t> would not have desirable 
properties. In fact it can be shown that it is asymptotically biased (see, 
for example, Martin, 1980). 

(ii) AI: Zt = + flit + Vt = Figure 1(b) shows a plot similar 

to that given in Figure 1(a). Here the aberration is in the innovation 
at q. The point (y^_i,yg) is not consistent with the least square line. 
However, (y^, yg+i) is consistent with the line although it is “away” from 
the rest. Later points (y^+i, yg^-i+t) are also consistent with the least 
squares line. The least squares estimate of <j>y in this case, is consistent. 
Thus the AI model does not lead to serious problems while the AO model 
does. 



‘Vq-I.yq’ 



X 

X X 



^t-1 



X X 
X 




Figure 1. (a) plot of yt vs. yt-i AR(1) with an AO outlier; (h) plot of yt 
vs. yt-i AR(1) with an AI outlier. 



It should be noted from the definition that an AO outlier a.tt = q affects 
only yq and consequently the residuals Cg, Cg+i, . . ., while an AI outlier at 
t = q affects Cg and hence yg, yg^_i, 
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3. DIFFERENT APPROACHES TO TIME SERIES 
ANALYSIS IN THE PRESENCE OF OUTLIERS 

We now consider the major approaches to analysis of time series when 
outliers may be present. 

3.1 Robustness 

Martin (1980) discussed robust estimation of the parameter of time series 
models in the presence of outliers. He defined the AO and AI models in 
a form slightly different to that given above.. The observations {y*, t = 
0,±1, . . .} are defined to be yt = where Zt is the actual generating 

process and €t is a ‘measurement error’ sequence. 

AO: In this case the ot’s are defined as before but €t is a random variable 
such that 



P(£j = 0) = 1 - a, P(et ^ 0) = a, 0 < a < 1. 

AI: Here 6* = 0 and at hzis a heavy tailed nonnormal distribution. 

With these characterizations robust estimates are obtained for the autore- 
gressive process of order p {AP(p)}, that is: 

p 

t = l 



This process can also be written as 

Zt = Xj^-hat, 

where ^t- 2 , • • • , and = (<^i, <^ 2 , • • • , <^p)- M-estimates 

of ^ may be obtained by minimizing X)r=p+i Pii^t — where p is a 

symmetric robustifying loss function, n is the number of observations and a 
is a robust scale estimate of a. This minimization leads to the estimating 
equation 

n 

t=p+l 

where ip, which is the derivative of p, is bounded and continuous, and where 
^ denotes the Af-estimate of 

In the AI case, since Xtt/;(-) is bounded in Xt, under suitable regularity 
conditions it can be shown that ^ is consistent and asymptotically normal. 
However, in the AO case XtV^(-) is not bounded and the Af -estimates are 
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“non-robust” . Thus one may obtain the so-called generalized M-estimates, 
or bounded influence estimates, by solving the estimating equation 



t=P+l 



where o;(Xt) is a weight function such that o;(Xt)Xt is bounded and contin- 
uous. The estimates obtained are also biased. However, this bias is smaller 
than that of either the M- or the least squares estimates. 

When the process is autoregressive moving average (ARM A), the situ- 
ation is more complicated. For the AI case, M-estimates can be obtained 
by the same procedure as described for the AR process. However, in the 
AO situation the generalized M-estimates, or bounded influence estimates, 
do not seem appealing. Thus Martin (1980) discussed approximate maxi- 
mum likelihood type estimates which are obtained iteratively. This approach 
seems to be extremely cumbersome. Martin (1983) also discussed a strategy 
for: 

(1) detecting outliers, 

(2) cleaning the data with a robustly-estimated autoregressive approxima- 
tion, and 

(3) building a model for the cleaned-up data using the usual procedures. 
3.2 Iterative Maximum Likelihood 

Generalizing the procedure of Fox (1972), Chang and Tiao (1984) dis- 
cussed a method of identifying and adjusting spurious observations. Their 
method is as follows. 

Suppose that the parameters ^ and 0 of an ARMA time series model are 
known. Then the outlier models can be written as 





et = uiir[B)It{q) + Ot= uiXu + at 


(AO) 


and 


€t = — W2®2t + 


(AI) 


where 


3t(S) = <j>{B)/e{B) = l-niB- Z^B^ - 






et = Jr{B)zt, 

xit = n{B)It(q), 





and 
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Approximate maximum likelihood estimates of oJi and 0 J 2 assuming q 
known are: 

Cji = ^ ^ ^ ^ ■“ ^i^g+i 

and 

W2 = X! X) 

and their variances are V{ui) = and ^( 0 ) 2 ) = 

The likelihood ratio statistics, Xiq = ojirj/ a and A 2 q = Cjqfa, are consid- 
ered, and the statistic 

A = max max Xu 

t l<i<2 

is used to check for outliers. When an observation is found to be an outlier, 
it is modified using oji or Cj 2 as the c£ise may be. This procedure is repeated 
with the new observations until no further outliers are found. Details of 
the procedure are given by Chang and Tiao (1984), and by Abraham and 
Ledolter (1983). The distribution of A is not yet well known. Also it is not 
clear how efficient the procedure is when more than one outlier in present. 

3.3 Bayesian Approach 

Abraham and Box (1979) discussed a Bayesian approach in which it is 
assumed that any innovation a* has a small prior probability, a, 0 < a < 1; 
of being aberrant; that is, it is generated from a normal distribution with 
mean uji and variance < 7 ^, and a complementary probability, 

1 - a, that it is not aberrant or in other words is AT(0,a^). Under this 
regime it was shown that the posterior distribution of = {(j>i , . . .,<^p) in 
an autoregressive process of order p is given by 

P(^ I y) = I ® given set is aberrant). 

r 

Here U(^r) is the posterior probability that a particular set of r innovations 
are aberrant given the data. Inference about ^ can be made using P{(f> | y) 
and provides information about the identity of outliers. 

A similar approach can be adopted in principle for the AO case . 

3.4 Lagrange Multiplier (L-M) Test 

Suppose is a vector of parameters in some space Q and one is interested 
in testing k restrictions 



= ^0 = 1 , 
j>0 



Ho-. hi{fi) = 0, 
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We consider the function 

3 = 1 

where I is the log likelihood and (jf = 1, 2, . . . , A:) a set of Lagrange mul- 
tipliers. Now let 




and 5 = d'/”^d where I is the information matrix under the null hypothesis 
Hq. S is usually referred to as the score statistic, and it can be seen that 
S = X H'I~^HX. Under S has an asymptotic chi-square distribution 
with k degrees of freedom (d.f.). 

We specialize by considering the familiar AR(p) process in the possible 
presence of an outlier, which may be written as follows: 

p 

Zt = + coi<l>{B)It{q) + at 0 J 2 lt{q)- 

i=l 



When (f> and <7^ are known it can be shown (Yatawara, 1985) that the score 
statistic to test for an outlier t = q is given by 







<t>i = 0, I > p. 



where et = Zt — are the recursive residuals. Sq has a chi-square 

distribution with 2 degrees of freedom and this may be used for tests of 
significance. 

In practice q is unknown and it may be necessary to consider St^t — 
p + 1, . . . , n — p. Thus we consider the statistic 



5o = max St 
(p+i)<t<»»-p 

and its distribution. Here we take p = 1 (AR(l)). The general C 2 ise is 
considered by Yatawara (1985). When p = 1, 1 S 2 , ^s, . . . are 1 step dependent. 
It can be shown that when n is large Sq heus an extreme value distribution. 
Thus one can find the significance points, c, from the relation 

Significance level (a) = P(5 q/ 2 — ln(n — 2) > c) = 1 — exp(— exp(— c)). 
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Table 1. Simulated significance points. 



n 


5% 


1% 


50 


13.04 (13.68) 


15.92 (16.94) 


100 


14.76 (15.11) 


18.20 (18.37) 


200 


16.12 (16.52) 


19.22 (19.78) 



(Values from the extreme value approximation are shown in parenthe- 
ses.) 



Accuracy of this approximation has been checked by a simulation exper- 
iment. We generated data from the process Zt = .4 Zt-i -h at {(t^ = 1) and 
obtained So for sample size 50, 100 and 200. This was repeated 500 times 
and the significance points obtained are shown, together with those from the 
extreme value approximation, in Table 1. 

There seems to be good agreement between the simulated and approxi- 
mate significance points when n = 100 or 200. For example when n = 100 
the approximate 1% significance point is 18.37 while the simulated one is 
18.20. 

The effect of estimation of parameters on So was studied by Abraham 
and Yatawara (1986). The indication is that one can consider the estimates 
of parameters and use Sq as if the parameters are known. 

Example: We consider the yield data reported by Abraham and Ledolter 
(1983), where it is shown that this series can be adequately represented by 
an AR(1) process. An observational outlier was introduced at t = 45. A 
plot of the data is shown in Figure 2. We analyse these data assuming that 
the identity of the outlier as unknown. Figure 2 shows the plot of St versus t 
(t = 2, 3, . . .). It is quite clear that Sq = ^45 and y4s is an outlier. It should 
be noted that since the process is AR(1), 544 and ^45 are large. 

4. DISTINGUISHING BETWEEN OUTLIER TYPES 

Suppose that we have established the existence of an outlier at t = q. 
Then it is important for further analysis to classify it as either AO or AI. 
From Section 3.2 for an AR(p) process it can be seen that 



o>i = 



n-p 

«=1 






tf>i = 0, i> p, 
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TIME 



Figure 2. Yield data with AO introduced at t = 45. 



and 



o>2 = 



We now consider the sums of squares 



SS{0) — ^ ’ 

SS{I) = el 



and D = SS{0) — S{I). If Dq > 0 we take the outlier as AO and if 
Dq < 0 we consider it as AI. Now the size of P{Dq > 0) depends upon 
oj/a {oj = (jJi or 0 J 2 ) and When a; = 0, P{Dq > 0) = 0.5. When a; ^ 0 the 
probability calculation requires numerical integration. Here we resort to a 
limited simulation experiment with 500 repetitions. The results are given in 
Table 2 and these indicate that Dq can differentiate between the outlier types 
quite efficiently. For example in the AR(1) case (^ = .5, (7^ = 1, oj = 4.5) 
when the outlier is AO, Dq declares it as AO 89% of the cases. In the AR(2) 
case with <f>i = 1.1, ^2 = —.7, < 7 ^ = 1, a; = 4.5 it correctly identifies an AI 
outlier in 97% of the cases. 




' 20 



40 60 80 100 120 140 160 

TIME 



Figure 3. Plot of St vs. t. 



Table 2. Simulated P{Dq > 0) 



Process specifications 



Outlier type P{Dq > 0) 
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5. DIAGNOSTIC CHECKS FOR OUTLIERS 



There are a number of diagnostic tools available for regression models. 
Some of these can be adapted for time series models. In this paper we 
consider only autoregressive models. Suppose (^i, . . . , z^) is a set of obser- 
vations generated by an autoregressive process of order p (AR(p)). Then we 
can write 

Z = A^ + a, (5.1) 

where Z' = {zp+i,. . .,z„), ^ a' = (op+i,...,a„) and 



^P— 1 



X= • 




IZn Z„-i 



^n—p -» 



Then the conditional least square estimates of ^ is given by ^ = (X'X) ^X'Z 
and Z = X(X'X)“^X'Z = HZ. Most of the diagnostic tools are based on 
the idea of deleting suspected observations and of building a measure of the 
change introduced by the deletion into some feature of the model. In the 
regression framework, deleting an observation and deleting an equation from 
(5.1) are equivalent. This is not true in the time series context. For deletion 
operations we consider the following partitioning: 





Xi 


(g - p) X p 


Zi 


x = 


X 2 


kxp Z = 


Z 2 




Xs 


{n — q — k) X p 





(? - p) X 1 
kxl 

{n — q — k) X 1 



Then the residuals 



e = Z-Z = 



«i 

©2 

©3 



= {I-H)Z 



r/-^n -^12 -His 1 




Zi 


CO 

1 

1 

1 




Z2 


CO 

1 

1 

1 




.Zs 



We now consider the statistics (see Draper and John, 1981; Little, 1985); 

(1) Qk(q) = e'zU - H22)~^C2, 

(2) Rk(q) = (1 - Qk(q)/RSS) I I - H22 I) Rss = residual sum of squares, 
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(3) = ei(/ “ H22r^H22{I - ff22)-'e2/pa2. 

Monitoring these statistics forg = p+ l,...,n — A;--p-f-l and for A; = 1 
and p-h 1, one can spot outliers and specify their types (see Abraham and 
Chuang, 1986). Asymptotic distributions and approximations for efficient 
calculations are described in the above paper. 

For illustration of the patterns, consider the Q statistic for an AR(1) pro- 
cess. When A: = 1, H 22 = ^l-i/'Z,t= 2 ^t-i> and Qi(,) = 

Note that the second factor in Qi(g) is approximately one. Thus Qi 
depends on Cg. If there is an AO outlier at t = g then Cg and Cg+i are large 
and hence Qi(g) and Qi(g+i) are large. If the outlier is AI then only Cg is 
large which leads to Qi(g) alone being large. 

If ik = 2 then Q depends on Cg and Cg^i. In this case, for the AO 
model, Q2{q)y Q2(g+i) affected by the outlier. However in the 

AI case only Q2(g-i) and (?2g are affected. Thus these statistics can help in 
specifying the outlier types. 

6. CONCLUDING REMARKS 

In this paper we have briefly outlined some major developments in the 
area of outliers in time series. It was not our intention to give an exhaustive 
survey. We also included some new tests based on the score statistic. Some 
diagnostic tools useful in detecting outliers are also introduced. 
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PREDICTING DEMANDS IN A MULTI-ITEM 
ENVIRONMENT 

ABSTRACT 

Demand forecasting for each item in a multi-line inventory system is 
considered. Demand patterns are assumed to satisfy an integrated moving- 
average process, and exponential smoothing is employed in predicting de- 
mands. An empirical Bayes estimator for the smoothing parameter using 
pooled information from all realizations is proposed. 

1. INTRODUCTION 

All inventory models require information concerning demands for stocked 
items, and control of stock replenishment is always based upon demand 
forec 2 ists. Hence, it is necessary to determine the nature of the processes 
generating demands in order to predict future demands. 

The forecasting problem may be formulated os follows. Let xi, 
X 2 ,...,xt_i be the past demands of an item under study and let Xt be 
the one-step- ahead forecast of demand xt at time t. Brown (1963), Kirby 
(1967), and Wecker (1978) have discussed the use of exponential smoothing 
in estimating future demand; that is: 



Xt = {1 - e)xt-i + 6xt-i, ( 1 ) 

where 9 is the smoothing constant and Xt-i is the one-step- ahead forec 2 ust of 
Xt-i. Ray (1982) also examined the use of autoregressive integrated moving- 
average models in demand forecasting. 

Most inventory systems carry a large number of items for which indi- 
vidual demand forecasts are required. These items can be claissified into a 
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number of categories by item and customer characteristics, with items in the 
different categories treated differently. Steece and. Wood (1979) have cleis- 
sified items into meaningful and predictable aggregates. Their forecasting 
method is based on the aggregate demand series and the fractional series ob- 
tained by dividing each item demand series by the aggregate demand series. 

In some cases, items can be classified into categories according to their 
stochastic behaviour. For example, semiconductor products with similar 
demand patterns are grouped in the same category. Demand patterns esti- 
mated from individual series may then be improved using pooled information 
from all realizations in the same category. Thisted and Wecker (1981) have 
examined the shrinkage estimator in exponential smoothing. Their method 
is based on the James-Stein (1961) estimator which assumes equal weights 
in the loss function. 

In the sequel we assume that the demand patterns satisfy an integrated 
moving-average process with different moving-average parameters. In the 
case where the number of past observations is small, the smoothing con- 
stant estimated from individual series generally will be inaccurate. Hence, 
an empirical Bayes estimator is introduced, and the shrinkage estimator is 
reviewed. Finally, an application is discussed, and simulation results on the 
performance of various estimators are examined. 

2. THE MODEL 

Let • • i^ir} be past demands for item i in a category of size 

n. We assume the demands satisfy the model 

~ 1 "b 1 j t = 1, 2, . . . , fi, t = l,2, ...,T, (2) 

where $i is the moving-average parameter for item i and the a*t’s are inde- 
pendent normally distributed random errors with mean zero and variance 

Here {x»t} satisfy an integrated moving-average (IMA) process of or- 
der (0, 1,1). Note that the exponential smoothing predictor in (1) gives the 
minimum mean squared error forecast for model (2) and is widely used in 
demand forecasting. The problem is to estimate the smoothing constants $i 
which characterize the demand patterns. 

Suppose the ^**s are sampled from an unknown prior distribution G{0). 
Let $i be an asymptotically efiicient estimate of based on the individual 
series {ajjt} for each item t, i = 1, 2, . . . , n (for reference, see Box and Jenkins, 
1976). We wish to improve the estimate $i in terms of squared error loss 
with additional information from the other ^’s. 
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3. THE SHRINKAGE ESTIMATOR 

Thisted and Wecker (1981) assumed equal variances for §i^s and pre- 
sented the following shrinkage estimator: 

|l-(n-3)sV^(^^-^'| {0i-e) + e, i = l,2,...,n, (3) 

where (i) 9 = (ii) 9^ = with g-^ an estimated 

standard error of and (iii) {•}+ is the positive-part rule. Efron and Morris 
(1973, 1975) and Morris (1983) have discussed the shrinkage estimator in a 
general context. 

Note that the assumption of equal variances for ^*’s is not always valid. 

4. THE EMPIRICAL BAYES ESTIMATOR 

Li and Hui (1983) have discussed an empirical Bayes approach in esti- 
mating random coefficient autoregressive processes. We employ the same 
technique here for the IMA(0, 1, 1) model in (2). 

Observe that $i given Oi has an asymptotic normal distribution; that is, 

f{0i I $i) = (27ra<)“» exp (2a<)} , (4) 

where = (1 - 9})/T. If one differentiates with respect to one obtains: 

dlog f{§i I 9i)/d0i = -{§i - 9i)/ai. (5) 



If is an estimate of a» , then equation (5) can be written as follows: 

0i = k + (dlog 1 0i)/d9i } . (6) 

Applications of the law of conditional probability yields an approximate 
Bayes estimator for 9i with a squared error loss given by: 

o; = E{9i I h) 

= k + (7) 

where is the marginal density of Oi and /q is its derivative. 
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The marginal density fo is unknown since the prior distribution is not 
given. Let fn{§i)/ fn{^i) be an empirical estimate of fGi^i)/fa{6i) from 
6i,§2, • ■ (for reference, see Parzen, 1962). If = (1 — 6^)/T is an 
estimate of a*, then the empirical Bayes estimate of $i is given as follows: 

0* = Oi + T-\l - 9^) {f'„{ei)/U{9i) } , • = 1, 2, . . . , n. (8) 

The performance of the empirical Bayes estimator was studied by sim- 
ulation and the results are presented in Section 6. Some theoretical results 
for a linear regression model were given by Singh (1985). 

Next, we consider the special case where the prior distribution G{9) 
is known to be normally distributed with mean 0 and variance Then 
9i ^ N{9i,ai) and 9i ^ N{9y<r^). The marginal density of 9^ is given by 

= {2ir (a,- + exp |-(^< - ^)*/(2(a< + <r*))| . (9) 

That is, 6i ~ N{9, ai + <r^). The conditional density of 9{ given 9i is then 



f{9i\9i) = f{ei\0i)f{ei)/n9i) 

= (2jrs)“5exp - u)^/(2s)| , (10) 

where u = {a^9i-]-ai9)/{a^ and s = cr^a»/(cr^-i-a,), or 9i | §i ^ N{u,s). 
For squared error loss, the '^es estimator is 

ef - {a^9i + otiO) / (ff* + a<) . 

An empirical Bayes estimator for 9i is then given by 

+ «<) , ( 11 ) 

where is estimated by d». This result can also be obtained from 

f'0i)lf{»i) = -{^i - 6)1 («< + (12) 



and equation (8). 




PREDICTING DEMANDS IN A MULTI-ITEM ENVIRONMENT 



95 



5. EXAMPLE 

Fifteen semiconductor products with similar stochastic behaviours in a 
manufacturing plant are grouped in a single category. The grouping is de- 
pendent on component characteristics and market demand. Demands {x^t} 
for each item in the past twenty weeks were obtained from the accounting 
department and the exponential smoothing predictor, 

Xit = (1 - , t = 1, 2, . . . , 20, (13) 

is to be employed in demand forecaisting. Precise estimates for the smoothing 
constants are required. 

The exact likelihood estimator is chosen as the estimator of the moving- 
average parameter; an algorithm was given by Ansley (1979). The exact 
likelihood estimates 9i^s computed for individual series {x»t} are given in 
Table 1. 



Table 1. Exponential Smoothing Parameter Estimates 
for Fifteen Semiconductor Products. 



Item 


e. 


n 




1 


.875 


.855 


.872 


2 


.800 


.794 


.801 


3 


.593 


.729 


.604 


4 


.926 


.907 


.920 


5 


.954 


.939 


.946 


6 


.739 


.759 


.742 


7 


.738 


.758 


.741 


8 


.957 


.944 


.950 


9 


.906 


.885 


.901 


10 


.920 


.900 


.914 


11 


.549 


.735 


.562 


12 


.798 


.792 


.799 


13 


.956 


.942 


.949 


14 


.677 


.736 


.684 


15 


.692 


.740 


.699 
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We zissume that the moving- average parameters were sampled from an 
unknown prior distribution 0(0). Following the procedure discussed above, 
we first estimate the marginal density fo ( 0 ) using a kernel density estimate 
discussed by Parzen (1962) and Clemmer and KrutchkofF (1968). Let 

X) {((sinAi)/Ai)^ - ((sinB,)/B,)*| 

f'n{6)/Ue)=^-^ , ^ , (14) 

nX) ((sinB<)/jBj) 

♦ =1 

where sinO/0 = 1, = (^ — ^» + A)/(2/i), and Bi = (0 — ^*)/(2h), with h = 

0 _ Y^i-i Oi/n. The empirical Bayes estimates, 

are computed using equation (8) and are presented in Table 1. For 
reference, the shrinkage estimates, $iy are also shown in Table 1 where the 
variance of $i is estimated by = (1 — 0^)T. 

Observe that both the empirical Bayes approach and the shrinkage 
method adjust the exact likelihood estimates towards the centre of the 
marginal distribution The improvements using empirical Bayes meth- 

ods were examined in a simulation study, which is presented below. 

6. SIMULATION STUDY 

We first sampled moving-average parameters {^i, • • • > from a 

known distribution 0(0). Brown (1963) noted that smoothing constants 
usually have a mean close to 0.8. We focused on beta distributions with 
means equal to 0.8 for 0(0). Starting values = 1,2, . . .,n were gen- 
erated independently from a lognormal distribution with parameters /i = 3 
and a = 2. The error variances were chosen as a? = 2 + = 1, 2, . . . , n, 

with high demand items assigned larger variances. Demand data were 
then generated from model (2). 

The exact likelihood estimate Oi was obtained for each individual series. 
The shrinkage estimate 0„. and the empirical Bayes estimate for the nth 
item were then computed. 

A set of simulation results is shown in Table 2 with n = 15, T = 20, a 
prior distribution of beta (8,2), and 10 replications. It can be seen that both 
the empirical Bayes estimates and the shrinkage estimates are closer to On 
than the exact likelihood estimates. In some cases improvements through 
the empirical Bayes approach are very encouraging. 

We fixed n = 15 and T = 20, and performed 1000 replications with 
various beta prior distributions. We let 

Rl{e* I ») = E 



n 



m 



(15) 
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Table 2. Simulation Results on the Estimation of the Smoothing 
Parameter 0^ in 10 Replications. 



Rep. 


Bn 


L 


Bn 


h 


1 


.830 


.940 


.930 


.934 


2 


.540 


.604 


.671 


.610 


3 


.658 


.744 


.744 


.744 


4 


.823 


.887 


.855 


.880 


5 


.738 


.911 


.897 


.909 


6 


.918 


.807 


.836 


.808 


7 


.843 


.636 


.684 


.643 


8 


.898 


.708 


.756 


.713 


9 


.879 


.933 


.926 


.931 


10 


.831 


.914 


.904 


.912 



and 

( 16 ) 

n m 

denote the relative mean squared errors and the relative mean absolute 
deviations respectively, where the summations are taken over 1000 repli- 
cations. A value of less than 1 in R1 or R2 indicates that 0* outper- 
formed 0 in estimating 0. We also denote by C(^* | 0) the total count 
that \ 0^ — 0n \ I I in 1000 replications. With a similar notation, 

we define R1{0 \ 0), R1{0* | 0 [), R2{0 \ 0), R2{0* \ 0), C{0 \ 0) and C{0* \ 0). 
Simulation results are presented in Table 3. They show that both the empir- 
ical Bayes estimator and the shrinkage estimator are superior to the exact 
likelihood estimator. The empirical Bayes estimator also outperforms the 
shrinkage estimator, and the improvement is significant. 

Similar simulation results were obtained for other prior distributions. 
Furthermore, other simulation results show that the more items in a cat- 
egory, the better the performance in pooling of information; that is, more 
information about the prior distribution is available. The empirical Bayes 
approach often gives significant improvement for short series. 
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Table 3. Simulation Results on the Estimation of the Smoothing 
Parameter 9^, in 1000 Replications. 



Prior 


Criteria 


0* 1 0 


0 1 0 


0* 1 0 




R1 


.747 


.966 


.773 


beta (4,1) 


R2 


.895 


.985 


.909 


C 


589 


640 


579 




R1 


.680 


.956 


.712 


beta (8,2) 


R2 


.839 


.978 


.858 


C 


692 


734 


669 




R1 


.545 


.940 


.580 


beta (20,5) 


R2 


.752 


.966 


.778 




C 


838 


869 


821 



7. CONCLUSION 

In conclusion, it can be noted that the empirical Bayes method can 
be extended to general autoregressive integrated moving- average demand 
patterns. 
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ON THE EFFICIENCY OF A STRONGLY CONSISTENT 
ESTIMATOR IN ARMA MODELS 

ABSTRACT 

Hannan (1975) showed that the initial estimates of the autoregressive 
parameters in an ARMA(p, q) model which were suggested by Box and Jenk- 
ins (1976, p. 499) are strongly consistent. In this note, the efficiency of this 
estimate in the ARMA(1,1) model is examined. 

1. INTRODUCTION 

Given the time series {Zt}y t = l,2,...,n, the mixed autoregressive- 
moving average model of order p and q respectively, ARMA(p, g), is defined 
to be 

<t>{B)Zt = e{B)au (1) 

where <t>{B) = 1 - <t>iB - 4>2B^ and 0{B) = l-0^B- 02B^ - 

• • • — 9qB^ are polynomials in R; R is the backshift operator such that 
BZt = Zt-i\ and {a*} is a Gaussian white noise with zero mean and variance 
(7^. The characteristic roots of <t>{B) = 0 and 0{B) = 0 are assumed to lie 
outside the unit circle and it is further assumed that there are no common 
roots. Without loss of generality, it is further assumed that the time series 
has zero mean, i.e., {Zt) = 0, where (•) denotes mathematical expectation. 

Let ^ = (<^i, . . . , <t>pY and 0 = (^i, . . . , Oq)'. For a pure autoregression of 
order p, denoted by AR(p), the vector of parameters can be determined 
by solving the Yule- Walker equation 

Pp4> = Pp, ( 2 ) 
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where 



Pp- 



r 1 

pi 



pi 

1 



pp-i 

Pp-2 



Pp-l Pp-2 1 -I 

Pp = {pu--,ppy 



and 



Pk ■■ 



{ZtZt-k) 

{Z?) ’ 



1 , 2 , 



Then the Yule- Walker estimate, can be obtained from (2) by simply re- 
placing Pk by its estimate r/t, the sample autocorrelation function defined 
by 



Ck 



where c* = n ^ X)"=i ZtZt+k- 

Hannan (1975) showed that in the ARMA(p, q) model the autoregressive 
parameters, <^i, . . . , <j>p, estimated by solving 



= j-q+l,...,q + p, 



( 3 ) 



i=l 



are strongly consistent. In the next section, the asymptotic efficiency of 
in the ARM A( 1,1) model is derived. 



2. EFFICIENCY OF <j>i 
The ARMA(p, q) model in (1) can be written as 

Zt — (j>iZt-i — ... — <l>pZt-p = at — 9iat-i — Oqat-i. 

If one multiplies by Zt-k and takes expected values, one obtains the follow- 
ing: 

lk-4>llk-l <l>p'1k-p = '1Za{k)-0l'1Za{k-l) ^qlZa{k-q), (4) 

where is the covariance function between the series Zt and Zt-k defined 
by = {ZtZt-k), and ')za{k) is the cross covariance function between Zt 
and at, and is defined by 7za(^) = {Zt-k^t)- 
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Upon dividing by 70 , equation (4) can be written, 

Pk - <l>lPk-l <t>pPk-p = 0 , k > q + 1. 

The estimate ^ is obtained by solving this equation and by replacing pa; by 
its sample estimate rjfc. 

For the ARMA(1,1) model, 




£2 

Cl’ 



Expanding in Taylor series up to first order terms yields: 



^1 = 4>1 + (C2 - 72) ' 



7i 



(‘^ 1 - 71 )^. 

7i 



As indicated by Lomnicki and Zarembra (1957), this expansion ignores terms 
0(l/n). Then 



^(^ 1 ) = A^(c 2) + - ^C0V(C1,C2). 



7? 



7? 



7? 



It is easily shown that, apart from terms 0(l/n^), 

n-l n-l 

n" 



and 



V{ci) = + 7t-«-i7t+i-»} 

” t=l «=1 

= ^ {70 + 2<^i7o7i + 7? + - <^i) I . 



n— 2 n— 2 



^(<= 2 ) ^ Z) Z) + 'Tt-*-27t+i-*} 

” t=i »=i 

- “ (70 + 3^?7i + 2^?7o7i + + “Ai)! . 



n— 1 n— 2 



cov(ci, C2) = -2 ^ ^ {7t-«7t-«-i + 7t-«-27t+i-«} 

” t=i «=1 

= I {^17? + 7i(l + 4>l) (70 + 3^) } . 



(5) 



( 6 ) 



(7) 



( 8 ) 
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If one substitutes equations (6), (7) and (8) into (5) one may obtain the 
following: 



where 




4<^i 

Pi 






1 + — 201^1 



Let denote the maximum likelihood estimate of <f>i. Box and Jenkins 
(1976, p. 242) give the asymptotic variance of as follows: 



V{^i) 



(i-0i^i)^(i-0D 

n(<^i - BxY 



Hence the asymptotic efficiency of relative to is 

{<l>i — ^i)^(2pi — 4/7 i<^i + -f 1) 



It follows that, 

lim Eff = 1 and lim Eff = 1. 

♦0 »o 

Hence is as efficient as j>i when $i is close to zero. 

Remark: The result of Section 2 also applies to the estimator of $i in an 
ARMA(1,1) model given by 0i = n(2)/n(l), where rt(-) denotes the inverse 
autocorrelation function. 

The asymptotic efficiency for various models is presented in Table 1. 
This table shows that is indeed as efficient as the maximum likelihood 
estimator when is near zero, but that the efficiency is very poor when 
both <f>i and Bi are close to negative one or when both </>i and Bi are close 
to positive one. 
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Table 1. Asymptotic Efficiency of Relative to 

0i 



01 


-0.90 


-0.60 


-0.30 


-0.95 


0.003 


0.288 


0.908 


-0.75 


0.012 


0.174 


0.739 


-0.50 


0.045 


0.215 


0.688 


-0.25 


0.107 


0.290 


0.696 


0.25 


0.340 


0.510 


0.783 


0.50 


0.516 


0.652 


0.846 


0.75 


0.736 


0.815 


0.919 


0.95 


0.944 


0.961 


0.983 



0.00 


0.30 


0.60 


0.90 


1.000 


0.983 


0.961 


0.944 


1.000 


0.919 


0.815 


0.736 


1.000 


0.846 


0.652 


0.516 


1.000 


0.783 


0.510 


0.340 


1.000 


0.696 


0.290 


0.107 


1.000 


0.688 


0.215 


0.045 


1.000 


0.739 


0.174 


0.012 


1.000 


0.908 


0.288 


0.003 
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RECENT RESULTS FOR TIME SERIES 
IN M DIMENSIONS 

ABSTRACT 

Recent results for time series in m dimensions are reviewed and a brief 
summary is given of previous fundamental theory. Many important fields 
which have been unexplored so far, are outlined briefly. 



1. INTRODUCTION 

In this paper, established results for time series in m dimensions are 
reviewed and further needed researches are considered. Other systems of 
spatial-time formulations are compared to those of time series in m dimen- 
sions. Excellent examples of time series in m dimensions are concerned with 
the characteristics of rivers, lakes, and oceans. Other examples include: the 
gulf stream; atmospheric characteristics such 3S the jet stream; pollution of 
streams; social problems; geographical problems; as well as others in science, 
industry, economics and business. 



2. TWO INTERESTING LONG-MEMORY PROBLEMS 

Example 1. N. G. Pisias and T. C. Moore, Jr. (1981) stated: “In the latter 
part of the Pleistocene, variations in global ice volume have been dominated 
by an approximate 100,000-year cycle. Analysis of a 2-Myr-long oxygen 
isotope record (gO^^/gO^®) from an equatorial Pacific core indicates this is 
only true for the last 900,000 years.” They state also that, besides this cycle, 
there are two shorter cycles with periods of 41,000 and 21,000 years. It can 
be shown that these last two periods are associated with periodic changes in 

^ Institute of Administration and Management, Union College, Schenectady, 
New York 12308 
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the 41,000 and 21,000-year components of the tilt of the Earth’s axis, and 
in the precession of the equinoxes as predicted by the astronomical theory 
of the ice ages. The eccentricity of the Earth’s orbit varies with a period 
of 413,000 and 100,000 years. The 100,000-year component in the spectrum 
of global ice-volume changes is not predicted by simple linear forcing of the 
Earth’s orbital variations; thus the origin of this phenomena is still being 
investigated. Presently many new cores have been obtained in the north 
polar and south polar regions. This allows a spatial-time series approach. 

Example 2. Another interesting example is concerned with the ages of 
bristle-cone pines (Pincus longaeva) in Arizona, Utah, and California, the 
oldest living trees — 4,500 years. A bristlecone pine, in East Central Nevada, 
cut down in 1964, exceeded 5,000 years in age while trees now dead had 
lives exceeding 9,000 years (Hitch, 1982). Their ages, found by counting 
tree rings, vary according to their locations, thus providing an interesting 
example to model by spatial time series. The ages of these trees have been 
used to calibrate carbon dating. 



3. GENERAL REMARKS 

The first articles discussing stationary spatial time processes (STM) in- 
volved moving average (MA) processes, autoregressive (AR) processes and 
their combination, ARM A processes. Ordinary time series have dimension 
m = 0 since they deal with the time parameter only. Their analysis is im- 
portant either as time increases (into the future) or as it decreases, thus 
delving into the past. STM processes are more complicated because, while t 
has two directions, each spatial dimension may change also in two ways; for 
m = 1, this means 4 possibilities, although some are essentially the same. 
For m = 2 this means 8 possibilities. Models such as 



Zt — at — 9iat-i — 



make very little sense but spatial time models, such as the following, do: 



Zx,t — ” 92Clx-l,t-li 



or 






A simple moving average model, such as the following, 



H" “■oo < t < to 
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may be written with backward shift operators, as follows: 

= {1 — B\Bt — ax,t — ^ Bt) ax,t> (3.1) 

where O^Bx^Bt) represents the generating function. Then, if | | + | ^2 | 

< 1, it can be shown that 



al=al{l + 0l + 9l), 

7oi = 

lio = 0 i92(tI, 

and 

7ll = 

where 7r« is the covariance function. In all models considered here it is 
assumed for convenience that E(zx,t) = 0. Also, it can be noted that 



02 = {-Pii -4Ui + +P?i)]-^ 

Thus, the requirement that 



1 - 4(/>oi +Pii) > 0. 



proves the restriction 

Poi +Pu < 1/4 

and 

-Pio < Pii + Poi < Pio lor 0 < Pio < 1, 
or 

Pio < Pii + Poi < -Pio lor - 1< Pio < 0. 

The only possible values of ^oi> ^oi, are those in the intersection of the 
above restrictions. 

The coefficients of the corresponding AR model are found by inversion of 
= {^~^{BxBt)}zx,t- Estimation, simulation, confidence limits for fore- 
casting 0i, and the power spectrum are given by Aroian (1985) and Voss et 
al (1980). 

The autocovariance generating function for the process 



or 



Zx,t — (1 “ - hFxFt)ex,t, 
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r(B,, Bt) = Bt)0{F,, Ft). 

This is the backward model for time and space with the same values of 
and $2 as given by (3.1). 

Now, consider a backward-forward model in x: 

Zx,t = “ ^20>x-l,t-l, 



with 



= E{Za:^tZx-l,t-k) 



representing the autocovariance between Zx,t and Zx-t,t-k- The auto- 
covariance generating function T{BxyBt) of such an MA process equals 
al9{Bx, Bt)0{Fxy Ft), and is the coefficient of both B^Bt and B^B^^. 
This result holds also for models where the characteristic function is 
9{Fx, Bt) with autocovariance generating function given by al9{Fx, Bt)9{Bx, 
Fx)] ^t,k is the coefficient of both F^Bt and B^Ft. 

Hence 



= <tI{ 1 - OiF^Bt - - eiB^Ft - O^F^Ft), 

al ^<^1(1 + 61+ el), 

720 = ^ 1 ^ 2)711 = ~^2 



and 

7i-i = -^1* 

For details regarding the ARM A model see the preceding references. 

Multivariate models have been proposed by Aroian (1985), but results 
are quite limited; this is a wonderful area for future research. The same 
thing is true for foreczisting when proceeding with the vector x backward 
and forward in all directions and combining this with time either backward 
or forward. Another problem area is non-stationary series. A regression 
approach appears feasible both for AR, MA, and ARMA models whether 
processes are stationary or non-stationary. However, the model must be 
correctly specified for least squares approximations to be effective. More 
work should be done on maximum likelihood, although approximate results 
have been obtained (Aroian, 1985). Exact results are possible but the ex- 
penditure of effort surely should be directed to other more fruitful aspects. 
The applications field is wide open, and it is hoped that this area may go 
forward. For present and past applications see Aroian (1985). What about 
the work of others in this field? Haugh (1985) has analyzed the connections 
between these results and those of Pfeifer and Deutsch (1980), and shows 
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that their model is a subset of time series in m dimensions. Bennett *s models 
(1979) are essentially the same as those of Aroian (1980, 1985) with a quite 
different mathematical approach. D. S. Stoffer (1985) generalized the work 
of Pfeifer and Deutsch (1980) using modified Kalman smoothed estimates 
and mentioned also the work of Larimore (1977). 
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TIME SERIES VALUED EXPERIMENTAL DESIGNS: 
ONE-WAY ANALYSIS OF VARIANCE 
WITH AUTOCORRELATED ERRORS 

ABSTRACT 

A methodology is developed for analysing factorial designs when the ob- 
servations at a particular treatment combination form a time series. Maxi- 
mum likelihood estimators of treatment effects and of time series parameters 
are found. Analogues of the standard F-ratios are proposed for testing treat- 
ment effects. A detailed discussion is given for the one-way classification with 
error variables generated by AR(1) processes. A simulation study for the 
case of two treatments is presented. 



1. INTRODUCTION 

Several models related to, but different from, that considered in the 
sequel have been discussed in the literature. Azzalini (1981) considered the 
model, yi [t)-fjL = (f - 1) - /i} -f a» (t) , where: y* (i) is the tth observation 
t = 1, 2, . . . , fii of the ith time series i = 1, . . . , A;; ^ is the AR(1) parameter 
for each time series; and a«(t) are the independent error variables having 
identical normal distributions with zero mean and variance Azzalini 
dealt mainly with the estimation of the parameters /z, and of the 
above model with special emphasis on the asymptotic results when A: — > oo 
and rii is fixed, say n» = n. Azzalini (1984) extended this model as follows: 

y» (0 “ M* "1" Ct (0 > t = l,...,A;, t = l,...,fi, (1*1) 
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with Zi{t) = <j>Zi{t — 1) In equation (1.1): is as before; /i» is the 

random effect of the ith subject or the effect due to the tth time series such 
that fjLi ^ where Xi is the p-dimensional vector of covariates 

and 13 is a p X 1 vector of unknown parameters; is the time effect due 
to non-stationarity of the series; and Zi{t) follows an AR(1) process. If 
p = 1, Azzalini’s (1984) scheme is a random effects two-way classification 
with autocorrelated errors. Andersen et al. (1981) studied a model similar 
to (1.1) but dealt mainly with a two-way fixed effects model with correlated 
errors. In Azzalini’s notation, Andersen et al. (1981) considered pi as the 
tth row effect and as the tth column or time effect. Box (1954b) also 
studied a model similar to that of Andersen et al. 

In eaoh of the above articles, time has been considered as a specific factor. 
More specifically, Andersen et al. (1981) applied their theoretical results to 
clinical trial data. They analyzed data on the plasma citrate concentration 
of A; = 10 subjects mesusured at n = 14 equally spaced time points, to detect 
whether the plasma citrate concentration changes during the day. In this 
application one notes that time is a specific factor. Azzalini (1984) also 
considered the same numerical example. 

However, in many situations where an investigation is repeated over 
time on physically independent material, and where external conditions can 
be treated as random, it may be sensible to treat time 2 is a non-specific 
factor (see Cox, 1984). For example, the use of automated data acquisi- 
tion equipment may make it possible to obtain many observations on the 
same treatment combination, but with only a small time interval between 
consecutive observations. The motivation for the formulation of the model 
considered below and for the concomitant analysis was a process control 
problem in which it was expensive to change the process parameters, but 
in which it was possible to make many observations in a short period of 
time for a fixed set of parameters. These observations formed a time series 
characterized by a high degree of correlation among contiguous observations. 
In other cases where data are collected over time it may not be clear that 
the white noise assumption is valid; one suspects for the most part these 
cases are analyzed routinely by ANOVA methods without challenging the 
independence assumption. 

At this point, it may be useful to comment on the use of the regression 
approach to analysis of variance with ARM A (p, q) error structure. Berndt 
and Savin (1975) have discussed Wald, likelihood ratio, Lagrange multiplier 
and max-root tests for testing the linear hypothesis in the multivariate lin- 
ear regression model. However, they have shown that these tests beised on 
exact distributions conflict with each other when applied to a given data set. 
In a later paper, Berndt and Savin (1977) showed that even in the asymp- 
totic casBy the Wald, likelihood ratio and Lagrange multiplier tests yield 
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conflicting inferences. Rothenberg (1984) has suggested that, under many 
regularity conditions on the behaviour of the error covariance matrix and 
the coefficient matrix of the linear regression model, Edgeworth-corrected 
critical values may be used for the above three tests. He claims that his size- 
adjusted tests do not conflict in the case of a one-dimensional hypothesis; for 
example, in testing da = ct2y where for A; = 2, ai and «2 are the treatment 
effects. This is because, in such cases, all three size-adjusted tests appear to 
have the same approximate power function. However, these tests fail to give 
a unique inference for multidimensional hypotheses; for example, in testing 
ai = a 2 — as or ai = Q2 = 0:3 = «4, where a’s are the treatment effects. 
Unlike the regression procedures discussed above, the methods given in the 
sequel provide an exact analysis for testing hypotheses involving treatment 
effects. The test in the present approach is computationally simpler than the 
tests used in the regression approach, and the proposed method of analysis is 
suitable for testing one-dimensional as well as multidimensional hypotheses. 
Furthermore, in general, one is also concerned about the inference from the 
size- adjusted test as it is based on approximations of different types at many 
stages (see Rothenberg, 1984, for details). However, a detailed comparison 
between the present approach and the regression approach discussed in the 
econometric literature is beyond the scope of this paper. 

The sequel develops the classical, bls opposed to the regression, approach 
to ANOVA for time series valued experimental designs in which time is a 
non-specific factor. The analysis is carried out in the context of a one-way 
classification with error variables forming an AR(1) time series, and with 
autoregressive parameters of all series assumed equal. The methodology 
may be generalized to more complex factorial designs with ARMA(p, q) error 
structure, where the autoregressive and/or moving average parameters of all 
series may or may not be equal. Distribution theory is discussed in Section 
4 for the C2ise where the ARM A parameters are known. In the final section, 
tables of the 5% and 1% values of the proposed test statistic are provided 
by a simulation study for AR(1) residuals and two treatments. 

2. MODEL FOR TIME SERIES VALUED 
EXPERIMENTAL DESIGNS 

The fixed-effects model with time series valued error variables is consid- 
ered below; more precisely, 

Yi(t) = fi + ai+Zi{t), (2.1) 

where 
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and where Yi{i) is the observation at time t due to the ith treatment, /i is 
the overall mean effect, is the effect of the tth treatment, and Zi{t) is 
a component of an ARMA(p, q) process. The notation of Box and Jenkins 
(1976) will be used; that is: 

MB) = {1- MB - MB^ MBn 

and 

0i{B) = (1 - 9uB - $2iB^ 0<,iB“), 

where B denotes the backshift operator and where <f>{B) and 9{B) have their 
zeros outside the unit circle. The a,(t) are independently and identically 
distributed as AT (0,(72). 

We consider in detail the AR(1) C 2 ise, although, as indicated above, the 
methodology may be extended to more complex ARM A models. Hence 

y;(t) = /i + a< + «i(t), (2.2) 



with 

where <l>i{B) = (1 — <f>uB)y and | <t>u | < 1. In this model denotes the 
AR(1) parameter for observations derived from the tth treatment. Data 
collected in accordance with model (2.2) may be graphed as in Figure 1. 











/O jO iO , A 










rwv 
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Tr. 1 Tr. 2 Tr. i Tr. k 

► Treatments 



Figure 1. Example of treatment time aeries traces. 



Analogous to, and in generalization of standard ANOVA definitions, we 
define a particular^ weighted sum of treatment effects to be zero; that is. 



SLi(l - <l>u)^ai _ Q 



(2.3) 
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Although with this general definition one may estimate the parameters 
of model (2.2), in the following discussion for the sake of simplicity, we shall 
restrict ourselves to the case where <j>u = (f> for all i. In many practical causes, 
it may be reasonable to assume that the autoregressive parameters are equal 
for all the series (see Azzalini, 1981, 1984; Andersen et al., 1981). 

From model (2.2) we note that 

a,(t) = Zi{t) - - 1) 

= {^<(0 -/*-«<}- 4>ii{Yi{t - 1) - /t - aj. 

Parameters may then be estimated by minimizing the conditional sum of 
squares, 5, where 



k n 






»=1 t=2 
k fl 

= EE[(5^i(0 - m,) - HYiit - 1) - m,}]^ 

t=l t=2 

with rrii = /x + a». One may now derive the normal equations and invoke 
(2.3), to obtain the following equations for estimation of mi, . . . , rrik and <l>: 



(n-l)(l-^) 



I — 1, . . . , A;, 



and 



<!> 



_ eLi Er=2{yji(*) - - 1) - "»♦} 



Ef=i Er=2{^*(*) - 

The Newton-Raphson iterative formula for obtaining an explicit solution 
is as follows: 



. (2.4) 
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\G J 



where 



Fi = {n- l)mi(l - - ^{Yi{t) - <f>Yi{t - 1)}, 

t=2 

G = ^EE<^(0 - "*•}' - EE{(^4«) - - 1) - m<)}, 



♦=1 t =2 



»=1 t =2 
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-"»<(« - 1 ) + - 1 ). 

^ t =2 

Gi = ^ = 2(n - l)m<(^ - 1) + + (1 - 24 >)Yi{t - 1)}, 



Gs = 



dG 

d<t> 






»=1 t =2 



We remark here that, for the case equation (2.3) reduces to 

which is the relationship among treatment effects assumed in 
standard ANOVA; this condition is used in solving the normal equations. 

Since the error variables have been eissumed to be independently and 
identically distributed as AT(0,(7a), (tI can be estimated by conditional max- 
imum likelihood as follows: 



aJi = 



1 

k{n — 1) 



- 1 ) - 



i=l *=2 



(2.5) 



3. TESTS FOR TREATMENT EFFECTS 

For the usual ANOVA model, viz, Vi(i) = /i H- + ai[t) with ai(i) are 
i.i.d. A(0,<72), one tests the null hypothesis, Hq : ai = a 2 = • • • = ctfe = 0, 
against the alternative Ha • a* ^ 0, for some i, by using the classical 
jP-statistic; namely, 

^ ^ ( 3 ,„ 

where F.. = DJLi ]Cr=i model 

(2.2) this statistic is inappropriate, since the dependence among observations 
implied by the model alters the amount of information provided by the 
observations. In the case of high positive correlation, as would be the case 
for <t> near unity, a new observation will provide very little new information. 
This radically alters the expected values of the components of the jP-ratio. 
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Also, in contrast to standard ANOVA, the treatment sum of squares and 
the error sum of squares are not independent, except for the case <j>u = 0. 
For these reasons the usual critical points obtained from the i^-tables are 
invalid. The following analysis characterizes the degree of invalidity and 
indicates that in cases of large correlation between contiguous observations, 
the standard F-test is in substantial error. 

The purpose of this paper is to provide analogous statistics which fit into 
the standard ANOVA paradigm, but which take into account the effects of 
correlation on the expected values of the sums of squares, particularly when 
the errors form an AR(1) process. In the sequel, we assume the autoregres- 
sive parameter is the same value for all series, and adopt the usual ANOVA 
notation. 

Let Yi{t) be the observation at time t which is due to the ith. treatment 
and which is generated as follows: 

y;(t) = /i + ai + «,(t), (3.2) 



with 

Zi{t) = 4>Zi{t - 1) + Oj(f), 

where o<(t) are i.i.d. N{0,al). Then the following statistic is proposed for 
testing Ho versus Ha- 

fcn(n-l) ELi(y«- C 2 («A) -gg- 

- 1) E-=i w - csy 

where 

= ( tV 

and 

C2(^) Z3 

n — 1 

Ho is rejected for large values of F*, 

We now examine the effects of autocorrelation on the sums of squares. 
Since = 0 and 

n 

Yi. = n + ai + Y^Zi{t)/n, 
t = l 

^ F<. = + XI ^ 

*=1 i=l t=l 



1 - . 



- Cl(^) 



1 - 



2^(1 - r ) 



n(l - <j>^) 



then 
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Because the a»(t)*s are independently and identically distributed as iV(0, (t^), 
it follows that 






Ki=l 






»=1 t=l t'=l 



_ I ...2 



A;/i^ +a^ci(<^)/n, 



(3.4) 



where ci(<^) is given by (3.3). Also, similar calculations show that 



E 




= A/i* + + cflci{(f)/n. 

t = l 



(3.5) 



Hence, from (3.4) and (3.5), one obtains for the treatment sum of squares, 
TrSS, 

E{TrSS) = E < n^(F<. - F..)=* 

I %=1 

- - l)ci(^)- 

»=1 

Therefore, under ffo? 



E = (fc - l)ci(^)- (3.6) 

The expression {k — l)ci(^) will be referred to as the “degrees of freedom” 
(d.f.) for TrSS. 

Now consider the error sum of squares, RSS. Equation (3.2) implies 
that 



E 



EEmw)’ 



.♦=1 t=l 



k 

= kniJ,^ -f + knal/{l — <jy^). 

»=i 



Hence (3.5) and (3.7) yield 



(3.7) 



E{RSS) = E 



k f% 



= k{n - 1 )<tIc2{4>), 



(3.8) 
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where C 2 {<t>) is given by (3.3). Hence, under both Hq and Ha, E{RSS/(tI} = 
k{n—l)c 2 {<t>)^ The expression A;(n— l)c 2 (^) will be referred to as the ‘‘degrees 
of freedom” for the error sum of squares. 

Thus, 

and 

£|r,55/{(t - 1)0. W)| = 

These expected “mean squares” suggest that Hq be rejected when F* is 
“large” , where “large” is determined from distributional results. 

It follows from (3.3) that the effect produced by autocorrelated errors is 
embodied in the ratio C 2 (^)/ci(^) which may be estimated as follows: 

C 2 (<^) _ n(l - «^){1 - ${n + l)/(n - 1)} + 2<^(1 - «^")/ (w - 1) 
ci(<A) {n(l - ^2) _ 2^(1 - 

For (j> = —0.9(0. 1)0.9; Ci(^), C 2 (^) and C 2 (^)/ci(<^) are graphed in Figures 
2 and 3 for n = 25 and 75 respectively. The case of independent obser- 
vations is represented by <^ = 0. The functions ci{(f>) and C 2 {<t>) show the 
effects of autocorrelated errors on the “degrees of freedom” for the treatment 
sum of squares and for the error sum of squares respectively. The graph of 
C 2 {</>)/ci{<l>) shows how the usual F statistic (for the independent case) is af- 
fected by autocorrelated errors. Although C 2 (^)/ci(^) depends upon sample 
size, for a wide range of n, MacNeill et al. (1985) indicated that the ratio 
may be approximated by (1 - 4>)/{l + ^) for | ^ | < 0.9. 



4. DISTRIBUTION OF THE TEST STATISTIC 



The test statistic F* may be written in the form 

F* = d^(F,. -Y.y/ 

*=i ' L=i t=i 



(4.1) 



where 

d = {kn{n - l)/{k - l)}{c 2 (^)/ci((^)}. 

We now examine the distribution of F* where d is assumed known. Hence, 
we consider 

^(F. -Y.y- if/d) izm) - f,.)4 < 0 

LU=i »=i t=i J 

= Pr[{Qi - if/d)Q 2 } < 0], 



Pr(F* < /) = Pr 



(4.2) 
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where 



and 



»=1 



k n 



i=l t=l 



Thus the problem is one of determining the distribution of the difference of 
two quadratic forms in the same variables. 

We consider equivalent formulations for Qi and Q 2 in terms of these 
same variables. The quadratic form Qi may be written as follows: 



Qi=Y,.AY,., 



(4.3) 



where 


v:.= 


{Yi.,Y2.,.. 


.,n.) 


and 


r(l-l/fc) 

-1/Jfc 


-1/fc 

(1-1/fc) 


-1/fc - 
-1/fc 


A = 




. -1/fc 


-1/fc 


(1-1 A)- 



We rewrite Qi as follows: 



Qi = Y*' ® A (ik ® Y*, 

where 

Y*' = {Yn,...,Yi^,Y2i,...,Y2n,...,Yki,...,Ykn), 

Ijb is an identity matrix of order k, and 0 denotes the Kronecker product. 
Qi can be written in more compact form as follows: 

Qi = Y*'{(I* - ® U„}Y7n^ (4.4) 



where U = 11'. 

We next consider the quadratic form in the denominator, ( 52 - First we 
observe that 

A; 

q2 = E y:® y., 

t=i 
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where YJ = (Vii, . . . ,yin) and B is the n x n matrix given by 

■(1 — 1/n) —1/n ••• — 1/n “ 

-1/n (1-1/n) ••• -1/n 

- “1/n —1/n ••• (1 — 1/n). 

Q2 can be written in compact form as follows: 

Q2 = 0 (In - n-^Un)}Y*, (4.5) 

where Y* and Un are defined as in (4.4). 

Now, we rewrite (4.2) using the results embodied in expressions (4.5) 
and (4.6): 

Pr[{Qi - {f/d)Q2} < 0 ] 

= Pr[{Y*((I* - ® U„)Y7n* 

- Y*'(I* 0 (I„ - n-‘U„))(//rf)Y*} < 0] 

= Pr[Y*'RY* > 0], (4.6) 

where 

R = M/-P, 

with 

M/=(//d){I*0(I„-n-'U„)} 

and 

P = {(Ifc-A:-iUfc)0U„}/n*, 

where d is given in (4.1). It may be noted that the vector Y* in (4.6) has 
the following distribution: 

Y* = (Yn, . . . , Fin, . . . , Yifci, . . . , nn)' - Nkn{m , 0 A), 

where 

with m* = /A -f for t = 1, . . . , A;, and 

(-1 ^ ••• r-M 

al <!> 1 <!>■■■ 

^ \-4,i : : : 

.^"-1 ^"-2 ... 



1 
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Now, consider the transformation A 2 Y* = Z», where ZJ = , ^m)- 

Then the distribution function given by (4.6) reduces to 

Pr(Z*XZ* > 0), (4.7) 



where 



Z* 



N 



(k 5(mi0ln) 



and is the kn x kn matrix given by 



) I/cn 



with 

Pa = {(Ifc - fc-'Ufc) ® A5U„A5}/n^ 

and 

Mf„ = {Ifc ® Ai(l„ - n-^U„)Ai}(//d). 

For a given /, our objective is to derive the distribution of Z*'R^Z* both 
under JTq and If a- Since and it may be observed 

that Rcr is a symmetric matrix. Suppose the rank of R^ is m < nA:. Then 
there exists an orthogonal matrix T such that 

R, = TA^r', (4.8) 



where 

Ax = diag(Ai, A 2 , . . . , • • • , Am), rn < nk. 

Therefore, the probability in (4.7) is given by: 

Pr(Z‘'R<,Z* > 0) = Pr(U*'r'R„rU* > 0) 

= Pr(U*'AAU* > 0), (4.9) 

where U* ~ N{6,lnk), and where 

r 1 

S= = T'$, 

-Snk- 



with 



A I (mi ® 1„) 

_A“5(mjfe® 1„)_ 
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Define Qf = Then the probability given by (4.9) can be written 

as follows: 

Pr ^<3/ = 

where Xjy j = are the eigenvalues of the matrix defined by 

(4.7). Note that no Ay in (4.10) is zero; however, some are positive and the 
remainder negative. Moreover, some may be multiple roots. Also note in 
(4.10) that the Uy’s are independent and Uy has a non-central chi-square dis- 
tribution with 1 degree of freedom and non-centrality parameter 6^ , where 6j 
is given in (4.9). Thus, Q/ in (4.10) can be written as a difference of two lin- 
ear combinations of independent non-central chi-square distributions, where 
each linear combination contains only positive coefficients. For the case when 
U and V are two linear combinations in independent non-central chi-square 
variables, the theoretical distribution ofT = {U — V) has been studied by 
Press (1966), among others. In view of Press’s result, the distribution of Qy 
is known and Pr{Q/ > 0} in (4.10) can, in theory, be calculated. However, 
it should be recalled that <j> has been assumed known, and it should be noted 
that the theory is awkward to apply for purposes of calculation. 



E 

i=i 



AyUy > 0 



m < nky 



(4.10) 



5. SIMULATION STUDY FOR THE CASE OF 
TWO TREATMENTS WITH AR(1) ERROR VARIABLES 

Consider the model given by (2.2). For the case when <j>u = <f>, i = 
1, . . . , A;, it has been suggested in Section 3 that for testing Hq : ai = a 2 = 
• • • = ak = 0 versus Ha : m* ^ 0 for some i, an appropriate test statistic is 

^ kn{n - 1 ) C2 (^) E,-=i (y<- - y ■ )^ 

The theoretical probability distribution of this test statistic when <f> is known 
has been given in Section 4. For significance testing one needs to know the 
values of / such that P{F* < f) = 0.95 or P{F* < f) = 0.99. In the special 
case, <l> = 0 and not estimated, one may use the usual F tables to determine 
/. But in the present case, calculations of exact probabilities through (4.6) 
to (4.9) of Section 4, and hence the determination of / for ^ = —0.9(—. 1)0.9, 
is cumbersome. Furthermore, <l> is generally not known. 

Therefore, a simulation study was conducted for the two treatment case 
(A: = 1), and tables were constructed for the b% and 1% values of F* for 
<j> = —0.9(0. 1)0.9 for n = 75 and n = 100. For each set of parameters, 
1000 simulations were carried out. The series were simulated, and ai, « 2 j 
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<l> and (7^ were estimated by the Newton-Raphson iteration procedure as 
described in Section 2. The F* statistic was calculated for each simulation 
and the 5% and 1% points were estimated. Generally speaking, the estimated 
percentage points were found to increase in magnitude as | ^ 1 increased. 
Hence, percentage values were smoothed over the range of values of the 
autoregressive parameter. The results are shown in Tables 1 and 2. 



Table 1. Simulation estimates of 5% and 1% points of the distribution 
of the F* statistic for two treatments and AR(1) errors: sample size of 75. 



^m(0.05) 



■ m(O.Ol) 



-0.9 


-0.890 


4.53 


7.82 


-0.8 


-0.795 


4.32 


7.40 


-0.7 


-0.691 


4.15 


7.06 


-0.6 


-0.599 


4.02 


6.82 


-0.5 


-0.506 


3.94 


6.66 


-0.4 


-0.405 


3.89 


6.59 


-0.3 


-0.305 


3.88 


6.61 


-0.2 


-0.209 


3.91 


6.72 


-0.1 


-0.111 


3.98 


6.91 


-0.0 


-0.011 


4.09 


7.19 


0.1 


0.084 


4.23 


7.56 


0.2 


0.179 


4.42 


8.02 


0.3 


0.279 


4.65 


8.56 


0.4 


0.377 


4.92 


9.20 


0.5 


0.472 


5.23 


9.92 


0.6 


0.571 


5.57 


10.72 


0.7 


0.667 


5.96 


11.62 


0.8 


0.763 


6.38 


12.60 


0.9 


0.855 


6.85 


13.67 



It should be noted that the effect of large correlation is to reduce the 
number of “degrees of freedom” . It can also be noted that the requirement 
to estimate (f> also increases the critical point. 
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Table 2. Simulation estimates of 5% and 1% points of the distribution of 
the F* statistic for two treatments for AR(1) errors: sample size of 100. 



<t> 






■ m(0.05) 



^m(O.Ol) 



-0.9 


-0.893 


4.13 


8.24 


-0.8 


-0.791 


4.00 


7.73 


-0.7 


-0.698 


3.89 


7.31 


-0.6 


-0.599 


3.81 


6.97 


-0.5 


-0.501 


3.76 


6.73 


-0.4 


-0.402 


3.74 


6.57 


-0.3 


-0.306 


3.74 


6.50 


-0.2 


-0.204 


3.77 


6.52 


-0.1 


-0.106 


3.83 


6.64 


0.0 


-0.011 


3.92 


6.83 


0.1 


0.088 


4.04 


7.12 


0.2 


0.188 


4.19 


7.50 


0.3 


0.287 


4.36 


7.97 


0.4 


0.385 


4.56 


8.52 


0.5 


0.481 


4.79 


9.16 


0.6 


0.576 


5.05 


9.90 


0.7 


0.677 


5.34 


10.72 


0.8 


0,770 


5.65 


11.63 


0.9 


0.866 


6.00 


12.63 



CONCLUSIONS 

Highly automated data acquisition systems make possible the rapid gath- 
ering of large numbers of observations in industrial processes. These obser- 
vations may be densely packed over time with the consequence that they 
may be highly positively correlated. Tables 1 and 2 and Figures 2 and 3 
indicate the dramatic effect of the presence of such autocorrelations. Hence 
the straightforward use of the standard F-test may be highly misleading. 
The modifications proposed to the F-test account for autocorrelation in the 
replications. 
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MONTHLY VERSUS ANNUAL REVISIONS OF 
CONCURRENT SEASONALLY ADJUSTED SERIES 

1. INTRODUCTION 

Statistics Canada’s official policy of using concurrent seasonal adjust- 
ment was established in 1975; gradually, other foreign statistical agencies 
followed it. The old practice for seasonally adjusting a current (monthly 
or quarterly) observation was to apply year-ahead seeisonal factors gener- 
ated from a series that ended in the month of December of the previous 
year. Since these projected factors were calculated ahead of the actual time 
they were applied, they didn’t take into account the most recent informa- 
tion incorporated into the series. On the other hand, the use of a concurrent 
seasonal factor to produce a current seasonally adjusted datum implies the 
use of all the data in the series up to and including the current month’s 
observation. 

Th6 main reason for using concurrent instead of seasonal factor forecasts 
is that the former are subject to smaller revisions as new observations are 
added to the series. This important result hzis been confirmed in several em- 
pirical studies; see among others, Dagum (1978), Bayer and Wilcox (1981), 
Kenny and Durbin (1982), McKenzie (1982, 1984) and Dagum and Morry 
(1984). 

There are two sources influencing the size of the revisions of current 
seasonally adjusted data: (1) differences in the moving averages or linear 
filters applied to the same observation as later data become available; and 
(2) the innovations that enter the series with new observations. Ideally one 
would like to minimize revisions due to filter discrepancies. Two studies by 
Dagum (1982a,b) have shown that if the current observation is seasonally 
adjusted using a concurrent seasonal factor instead of a year-ahead factor, 
the corresponding concurrent linear filter is subject to smaller revisions than 
any of the year-ahead seasonal filters. The same conclusions have been 

^ Director Time Series Research and Analysis Division, Statistics Canada, Ot- 
tawa, Ontario KIA 0T6 
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reached in a recent study by Pierce and McKenzie (1985) from a time series 
analysis viewpoint. 

The use of concurrent seasonal factors for current seasonal adjustment 
poses the problem of how often should the series be revised. In this regard, 
Kenny and Durbin (1982) recommended that revisions should be made after 
one month and thereafter each calendar year. Dagum (1982b) supported 
these conclusions and furthermore, recommended an additional revision at 
six months if the sezusonal adjustment method is the X-ll-ARIMA without 
ARIMA extrapolation. In this case, the X-ll-ARIMA (Dagum, 1980) closely 
approximates the Census Method II-X-11 version (Shiskin et al., 1967) ex- 
cept for changes in the treatment of outliers and the use of more accurate 
end weights for the seaisonal moving averages. 

Recently, Burridge and Wallis (1984) showed that the X-11 filters are 
not internally consistent in a signal extraction sense. In fact, they observed 
that the transfer function of the first year revised concurrent filter differed 
more from that of the symmetric filter (to which it should converge) than 
the transfer function of the concurrent filter itself. It would appear as if 
the transition from asymmetric to symmetric filters was not gradual for all 
frequencies. 

This paper deals with the problem of consistency between successive 
filters in relation with the revision pattern of the concurrent linear filters of 
X-ll-ARIMA and X-11. Section 2 introduces two measures of filter revisions 
given by the root mean square differences between the frequency response 
functions of the analysed filters. Section 3 estimates and discusses the time 
paths of the revision of the concurrent seasonal adjustment filters of X-11 and 
X-ll-ARIMA for consecutive month-spans. Section 4 estimates and analyses 
the time paths of the monthly and annual revisions of the concurrent and 
remaining asymmetric filters, and Section 5 gives the main conclusions of 
this study. 



2. MEASURES OF FILTER REVISIONS 

Under the assumption of an additive decomposition model and no re- 
placement of extreme values, the seasonally adjusted estimates from X-ll- 
ARIMA with and without ARIMA extrapolations are obtained by the ap- 
plication of a set of moving averages or linear filters. For central or middle 
observations, say n-bl<t<T — n, the filter is always the same and sym- 
metric whereas for the remaining n observations on both ends of the series, 
the filters are asymmetric and different for each observation. 

We can express the seasonally adjusted value, for recent years, from 
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X-ll-ARIMA and X-11 by 

= ( 2 . 1 ) 

j=z—m 

where is the seasonally adjusted estimate from a series xt_r»+i, 

. . . , a:*, . . .jXt-fmi denotes the moving average weights to be applied 
to the series and denotes the corresponding linear filter using the 

backshift operator B, such that B^xt = Xt-n- 

For m = 0, is the concurrent seasonally adjusted value and h^^^(B) 
the corresponding concurrent filter; for m = 1, is the first-period 
(month, quarter) revised seasonally adjusted figure and for m = n, 
is the final seasonally adjusted value in the sense that it is estimated with a 
symmetric filter where hnj = for all j. For any two points 

in time t -j- kj t -h £ (k < £)y the revision of the seasonally adjusted value is 
given by 

= k < t. (2.2) 

This revision reflects: (1) the innovations introduced by the new observations 
xt 4 -A;+i, xt-i-fc+ 2 ) • • • ) and (2) the differences between the two asymmet- 
ric filters h^^^B) and h^^^{B). If one fixes A: = 0 and lets £ vary from 1 to n, 
then (2.2) gives a sequence of revisions of the concurrent seasonally adjusted 
value for different time spans or lags. The total revision of the concurrent 
estimate is obtained for £ = n. If one fixes £ = A: -(- 1 and lets k take values 
from 0 to n — 1, then equation (2.2) gives the sequence of single period re- 
visions of each estimated seasonally adjusted value, and in particular, if one 
starts at A: = 0 one obtains the n — 1 successive single-period revisions of 
each estimated seasonally adjusted value before it becomes final. If one fixes 
£ = A; H- 12 and lets k take values from 0 to n — 12 then equation (2.2) gives 
the sequence of annual revisions. The revisions in which we are interested 
here are those introduced by filter discrepancies, and these will be studied 
by looking at the frequency response functions of the corresponding filters. 

Equation (2.1) represents a linear system where y\^^ is the convolution 
of the input Xt and a sequence of weights hm,j called the impulse response 
function of the filter. The properties of this function can be explored using 
its Fourier transform which is called the frequency response functionj defined 
by, 

n 

E 0<o;<l/2, (2.3) 

j=—m 

where u; is the frequency in cycles per unit time. Theoretically the frequency 
response function of a (discrete) filter exists over the interval [—1/2, 1/2] 
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but, in practice, it is sufficient to describe it for a; 6 [0, 1/2] so that it 
is completely described in the frequency domain. H{(jj) fully describes the 
effects of the linear filter on the given input. In general, the frequency 
response function may be expressed in polar form as follows: 

H{w) = A{w) + »B(«) - (2.4) 

where G(w) = [A(w) + is called the gain of the filter, and = 

arctan[jB(o;)/A(a;)] is called the phase angle of the filter and is expressed 
in radians. The gain and the phase angle vary with the frequency u. For 
symmetric filters, the phase angle is zero or ±;r and for asymmetric filters 
it can take any value between ±7 t; it is undefined at those frequencies where 
the gain is zero. 

Following Dagum (1982a,b) we introduce next three measures of filter 
revision based on the root mean square revision of different filters over all 
the frequencies. 

The first measure is: 






fl/2 ] 

2/ II 
Jo 



0<OJ< 1/2, £= 1,2,3,. ..,n. 



(2.5) 



where is the frequency response function of the concurrent seasonal 

adjustment filter and is the frequency response function of a filter 

shifted £ periods with respect to the concurrent. Taking into consideration 
that for monthly series the symmetric seasonal adjustment filter of X-11 
can be well approximated with 7 years of data plus one (see Young, 1968; 
Wallis, 1974) and similarly for X-ll-ARIMA (see Dagum, 1983); the (a;) 
corresponds to the filter applied to the last observation of a series consisting 
of at least 85 data points. This filter becomes central or symmetric after 
the series is extended with forty-two more observations and thus 
denotes the frequency response function of the symmetric filter. 

Equation (2.5) gives the time path of the concurrent filter as it ap- 
proaches the symmetric filter for £ = 1, 2, . . . , 42. 

The second measure of filter revision to be used in this study refers to 
the differences between consecutive filters defined by 



j^{k+l,k) _ 



pl/2 

/ II 

Jo 



t1/2 



d(jj 



0<w<l/2, fc = 0,l,2,...,n- 1. 



( 2 . 6 ) 
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Equation (2.6) gives the time path of single-period revisions of the filters as 
new observations enter into the series. In the case of monthly data to be 
discussed in this study, gives the time path of the monthly revisions, 

A third measure gives the time path of annual revisions and is defined 
by: 



nl/2 

^(fc+i2,A=) ^ 2 / i| 1|2 d 

Jo 

0<u<l/2, fc = 0,l,2,3,...,n-12. 



Equations (2.6) and (2.7) are useful for assessing the frequency of revisions 
of the concurrent seasonal adjustment filter as new observations enter into 
the series. 



3. TIME PATH OF THE CONCURRENT SEASONAL ADJUSTMENT 
FILTERS OF X-11 AND X-ll-ARIMA 

The £ = 1,2,..., 42, measure given in equation (2.5) has been 

calculated for the X-11 and X-ll-ARIMA concurrent filters. The ARIMA 
extrapolation model applied is the classical (0,l,l)(0,l,l)i2 IMA type (Box 
and Jenkins, 1970) of the following form: 

(1 - B)(l - B^^)Xt = (1 - 0)(1 - eB^^)at. (3.1) 

Since the extrapolations affect significantly the concurrent filter depending 
on the parameter values of 9 and 0, we selected some combinations of values 
often found when modelling economic time series. These are: 

^ = 0.40 0 = 0.40 ^ = 0.60 0 = 0.40 ^ = 0.80 0 = 0.40 

^ = 0.40 0 = 0.60 ^ = 0.60 0 = 0.60 ^ = 0.80 0 = 0.60. 

9 = 0.40 0 = 0.80 9 = 0.60 0 = 0.80 9 = 0.80 0 = 0.80 

The smaller the value of the more flexible or changing the trend-cycle 
component is assumed to be. Similarly, the smaller the value of 0, the more 
flexible or moving the seasonal component is assumed to be. 

Table 1 gives a summary of the values of: (1) the total revisions of the 
concurrent filters (2) the revisions of the concurrent filters after a 

year of observations has been incorporated into the series, and (3) 

the revisions after 13 months, i.e., 

We note from Table 1 that the total revisions of the concurrent filters 
are always smaller if ARIMA extrapolations are used. These observations 
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Table 1. Root Mean Square Revisions Over all Frequencies 
of the Concurrent Seasonal Adjustment Filter for Selected Month-Spans 



Total First Year 13 month-span 
X-ll-ARIMA Revisions Revision Revision 

Method ij(i2,o) 



Without 

Extrapolations (X-11) .36 .29 .30 

With 

Extrapolations from 
Model (0,1,1)(0,1,1) 

(Parameter Values) 



o 

II 


o 

II 


.34 


.30 


.34 


O 

II 


0 = .60 


.32 


.27 


.31 


II 

O 


0 = .80 


.30 


.25 


.27 


B = .60 


II 

o 


.34 


.32 


.34 


o 

II 


0 = .60 


.32 


.28 


.30 


e = .60 


o 

00 

II 

CD 


.30 


.25 


.26 


II 

bo 

O 


0 = .40 


.34 


.33 


.34 


o 

00 

II 


0 = .60 


.32 


.29 


.30 


o 

00 

II 


o 

00 

II 


.30 


.26 


.26 



conform to those given by Dagum (1982a, b) although these earlier studies 
referred only to the revisions of the seasonal frequency bands whereas here 
we are analysing the revisions over all the frequencies. The root mean square 
total revision reduction ranges from 20% for 0 = .80 to 6% for 0 = .40. 

Second, the speed of convergence of the concurrent seasonal adjustment 
filter to the symmetric filter is faster for X-ll-ARIMA with extrapolations 
than without extrapolations (X-11). After 13 months, represents 

between 88% to 100% of the total revision depending on the values 

of 0. For 0 = .40, which implies a fast moving seasonality, the total revision 
is completed after the first year whereas for 0 = .80, which corresponds to 
a more rigid or stable seasonal pattern, only 88% of the total revision is 
corrected during the same period. 

It is important to point out here that the revisions from the concur- 
rent filter to the 12 month lag filter are not monotonic for each frequency 
(jj. In fact, although over all the o;^s, the reverse occurs 
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for the revisions associated with the frequencies oj that fall between 0 and 
0.050 which generally are attributed to trend and cyclical variations. This 
remark agrees with that of Burridge and Wallis (1984), who showed the 
presence of inconsistencies between the concurrent and the 12 month lag 
filters. Table 2 shows the revision measures and for 

two frequency bands, namely, 0 < a; < 0.050 and 0.050 < oj < 0.50 which 
are often attributed to the trend-cycle and seasonal plus irregular variations 
respectively. 



Table 2. Root Mean Square Revisions of the Concurrent Seasonal 
Adjustment Filter for Selected Frequency Bands and Selected Month-Spans 



12 month-span 
Revisions 



24 month-span 
Revisions 

i^(24,0) 



Total 

Revisions 

R{42,0) 



X-ll-ARIMA 



Method 




Tr. 


Ir. 


Tr. 


Ir. 


Tr. 


Ir. 


Without 
















Extrapolations 

(X-ll) 


.09 


.31 


.05 


.39 


.05 


.38 


With 
















Extrapolations from 
Model (0,1,1)(0,1,1) 
(Parameter Values) 
^ = .40 e = .40 


.24 


.31 


.14 


.36 


.15 


.36 


9^ AO 


e = .60 


.21 


.27 


.12 


.34 


.12 


.34 


9 = .40 


0 = .80 


.18 


.24 


.11 


.32 


.11 


.32 


9 = .60 


0 = .40 


.28 


.32 


.17 


.37 


.17 


.36 


9 = .60 


0 = .60 


.24 


.28 


.15 


.35 


.15 


.34 


o 

CD 

II 


0 = .80 


.21 


.25 


.13 


.32 


.13 


.32 


9 = .80 


0 = .40 


.34 


.33 


.20 


.36 


.20 


.35 


9= .80 


0 = .60 


.29 


.29 


.18 


.34 


.18 


.34 


II 

bo 

o 


o 

00 

II 


.26 


.26 


.16 


.32 


.16 


.32 



Tr. = Trend-cycle (0 < a; < .05) 
Ir. = Irregular (.05 < w < .5) 
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We can see that the revisions of the low frequency band are larger after 12 
months than when 24 or 42 months have been added to the series. Although 
not shown, smaller discrepancies are also observed for and 

These discrepancies disappear for most causes after 16 months and, in all 
cases, after 24 months where the concurrent filter revisions are equivalent 
to those obtained from the final filter. These results imply that second year 
revisions would suffice from the viewpoint of filter changes. 

We also observe larger revisions of the frequencies attributed to trend 
and cyclical variations when the ARIMA extrapolations are used. In prac- 
tice, however, these revisions would not be observed for they correspond 
mainly to short cycles (3 years or less) which would not be present if the 
series is well represented by an IMA model of the (0, 1, 1)(0, 1, l)^ type used 
for the extrapolations. 

Summarizing the above observations, the time path of the various con- 
current filters shows that the use of ARIMA extrapolations is highly ben- 
eficial from the viewpoints of: (1) the size of the total revisions, which is 
significantly decreased; and (2) the period of time required for the concur- 
rent filter to converge to the final symmetric filter, which is also significantly 
decreased. 



4. TIME PATH OF MONTHLY AND ANNUAL REVISIONS OF THE 
SEASONAL ADJUSTMENT ASYMMETRIC FILTERS OF 
X-11 and X-ll-ARIMA 



4.1 Monthly Revisions 

The measure of equation (2.6) has been calculated for the X- 

11- ARIMA concurrent filters with extrapolations from the ARIMA models 
discussed in the previous section and for X-11- ARIMA without extrapo- 
lations (X-11). The monthly revisions also measure the distance 

between consecutive asymmetric filters. 

One important set of single-period revisions is that corresponding to 
the first year; that is, for time lags i = A:-|-l = 1,2,. ..,11. These eleven 
monthly revisions should improve the seasonal adjustment filter because of 
the improvement in the weight system of the 13-term Henderson trend-cycle 
filter which becomes symmetric after six observations have been added to 
the series. We will discuss later whether it is advisable or not to revise 
eleven times the concurrent filter in order to improve the seasonally adjusted 
estimates. 

Another set of consecutive filter revisions of interest includes £= 12 
and £= 24. These revisions refiect the improvement in the 3x5 (7-term) 
seasonal moving average weights which change from year to year (being 
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constant within the year) until they become symmetric (after three years) . 
Finally, the revisions at £ = 13 and £ = 25 are important because they are 
due to the fact that in the X-ll-ARIMA the seasonal estimates are forced 
to sum to zero (12) over each calendar year if an additive (multiplicative) 
decomposition model is applied. 

The observations drawn from Table 3 can be summarized as follows. 
First, the pattern of monthly revisions is the same whether ARIMA extrap- 
olations are used or not. This pattern is characterized by a rapid decrease 
in the monthy revisions for £ = 1,2, and 3; and slow thereafter till £ = 11; 
then a large increase (reversal of direction) occurs at t— 12 followed by a 
rapid decrease for £ = 13 then another large increase at £ = 24 followed by 
a rapid decrease at £ = 25. 



Table 3. Monthly Root Mean Square Revisions, 

Over All Frequencies of the Concurrent and Asymmetric Filters 
of X-ll-ARIMA With and Without Extrapolations (X-ll) 



With Extrapolations, from Model (0,1,1) (0,1,1) 
and selected parameter values 

Monthly Without 

Revisions Extrapo- 0 = .40 ^ = .40 ^ = .40 ^ = .60 ^ = .60 ^ = .60 0 = .80 0 = .80 0 = .80 
f = fe + 1 lations 0 = .40 0 = .60 0 = .80 0 = .40 0 = .60 0 = .80 0 = .40 0 = .60 0 = .80 



1 


0.122 


0.176 


0.148 


0.123 


0.136 


0.115 


0.090 


0.100 


0.079 


0.063 


2 


0.066 


0.130 


0.113 


0.098 


0.102 


0.087 


0.074 


0.078 


0.065 


0.054 


3 


0.024 


0.089 


0.081 


0.073 


0.071 


0.065 


0.058 


0.056 


0.051 


0.045 


4 


0.022 


0.056 


0.054 


0.051 


0.048 


0.046 


0.044 


0.041 


0.039 


0.038 


5 


0.037 


0.034 


0.033 


0.033 


0.033 


0.033 


0.033 


0.033 


0.032 


0.033 


6 


0.041 


0.022 


0.018 


0.018 


0.024 


0.021 


0.021 


0.026 


0.024 


0.024 


7 


0.033 


0.019 


0.015 


0.014 


0.016 


0.013 


0.012 


0.018 


0.015 


0.015 


8 


0.018 


0.022 


0.019 


0.018 


0.014 


0.009 


0.008 


0.009 


0.009 


0.008 


9 


0.014 


0.029 


0.027 


0.026 


0.012 


0.011 


0.010 


0.007 


0.007 


0.006 


10 


0.025 


0.041 


0.038 


0.037 


0.019 


0.017 


0.016 


0.013 


0.011 


0.010 


11 


0.030 


0.057 


0.053 


0.051 


0.026 


0.024 


0.023 


0.015 


0.012 


0.012 


12 


0.210 


0.259 


0.228 


0.199 


0.277 


0.245 


0.215 


0.293 


0.262 


0.232 


13 


0.108 


0.136 


0.118 


0.101 


0.104 


0.088 


0.073 


0.075 


0.062 


0.050 


14 


0.059 


0.103 


0.091 


0.080 


0.080 


0.069 


0.060 


0.060 


0.051 


0.043 


15 


0.022 


0.071 


0.065 


0.059 


0.057 


0.052 


0.047 


0.045 


0.040 


0.036 


16 


0.019 


0.046 


0.043 


0.041 


0.038 


0.037 


0.035 


0.033 


0.031 


0.025 


24 


0.122 


0.150 


0.134 


0.119 


0.161 


0.145 


0.129 


0.172 


0.155 


0.139 


25 


0.066 


0.079 


0.069 


0.060 


0.058 


0.050 


0.042 


0.040 


0.034 


0.028 


26 


0.036 


0.063 


0.057 


0.050 


0.048 


0.042 


0.037 


0.030 


0.030 


0.026 


36 


0.054 


0.062 


0.058 


0.054 


0.067 


0.062 


0.059 


0.072 


0.067 


0.063 


37 


0.031 


0.032 


0.029 


0.027 


0.021 


0.020 


0.019 


0.013 


0.012 


0.011 
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The significant decreases for the first three consecutive revisions are due 
to the improvement of the Henderson filter weigh'ts. What looks like a rever- 
sal of direction in the size of the filter revisions at £ = 12 and £ = 13, is due 
to the improvement in the seasonal weights which become less asymmetric 
from year to year until three full years are added to the series. 

Second, the effect of the ARIMA extrapolations can be observed in the 
monthly revisions during the first year, particularly at £ = 1, 2 and 3 where 
the revisions tend to be larger for small 9 and © whereas the opposite occurs 
for large 9 and ©. 

Third, we note that the consecutive single-period revisions do not de- 
crease monotonically within the year. Although the revision values are very 
small for £ > 4; there are reversals of direction at £ = 5,6 and 10 when 
no extrapolations are used. There is only one reversal and at a later lag if 
ARIMA extrapolations are used; this occurs at £ = 7 for ^ = .40 and at 
£ = 10 for ^ = .60 and 0 = .80. This pattern repeats for the second year 
after a large jump at £ = 12 and again during the third year after lag £ = 24. 

Since the monthly revisions during the first year are not monotonically 
decreasing, it is not advisable to revise every time a new observation enters 
into the series. Revising the concurrent filter eleven times will introduce un- 
wanted revisions because the distance between consecutive asymmetric filters 
does not decrease monotonically as the lags for filters approach to £ = 12. 
Although not shown here, this inconsistency of the distance between asym- 
metric filters is mainly due to the phase angle of the filters and affects par- 
ticularly the high frequencies (jj associated with the irregulars. This type of 
inconsistency between the distance of consecutive asymmetric filters does not 
imply however that the time path of the total distance of each asymmetric 
filter with respect to the final is inconsistent. In fact, the total distances of 
each asymmetric filter, i.e., , fc = 0, 1, 2, . . .41, decrease monotonically 

with increasing k. 

Finally, we observe that the two largest single period revisions occur at 
£ = 1 and £ = 12. 

4.2 Annual Revisions 

The measure of equation (2.7) has been calculated for the 

X- 11- ARIMA asymmetric filters with and without ARIMA extrapolations. 

Table 4 shows that for each asymmetric filter, A: = 0, 1, 2, . . .30, its corre- 
sponding annual revision converges monotonically and very fast to zero. For 
example, the revision of the concurrent filter of X-11 after 12 months (first 
annual revision) is 0.29, its second annual revision is 0.20 and its third annual 
revision is 0.11. A similar pattern is followed by the remaining asymmetric 
filters. This monotonic convergence holds for the root mean square revisions 
over all frequencies. For the band of low frequencies associated with cyclical 
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Table 4. Annual Root Mean Square Revisions, 

Over All Frequencies of the Concurrent and Asymmetric Filters 
of X-ll-ARIMA With and Without ARIMA Extrapolations (X-11) 

With Extrapolations , from Model ( 0 , 1 , 1 ) ( 0 , 1 , 1 ) 
and selected parameter values 

Annual Without 



Revisions Extrapo - 


^ = .40 




.40 


0 = 


.40 


o 

o 

II 


^ = .60 


0 = 


.60 




.80 




.80 




.80 


t = k -^12 lations 


0 = . 4O 


0 = 


.60 


0 = 


.80 


0 = . 4O 


0 = . 6O 


0 = 


.80 


0 = 


.40 


0 = 


.60 


0 = 


.80 



12 


0.29 


0.30 


0.27 


0.23 


0.31 


0.28 


0.25 


0.33 


0.29 


0.26 


13 


0.27 


0.35 


0.31 


0.27 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


14 


0.26 


0.35 


0.31 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


15 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


16 


0.26 


0.33 


0.30 


0.26 


0.33 


0.29 


0.26 


0.33 


0.29 


0.26 


17 


0.26 


0.33 


0.29 


0.26 


0.33 


0.29 


0.26 


0.33 


0.29 


0.26 


18 


0.28 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


19 


0.27 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


20 


0.27 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


21 


0.27 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


22 


0.27 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


23 


0.27 


0.34 


0.31 


0.26 


0.34 


0.30 


0.26 


0.34 


0.30 


0.26 


24 


0.20 


0.19 


0.17 


0.15 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


25 


0.18 


0.22 


0.19 


0.17 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


26 


0.16 


0.21 


0.19 


0.17 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


27 


0.16 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


0.19 


0.17 


0.16 


28 


0.16 


0.20 


0.17 


0.16 


0.19 


0.18 


0.16 


0.20 


0.17 


0.16 


29 


0.16 


0.19 


0.17 


0.16 


0.19 


0.18 


0.16 


0.20 


0.17 


0.16 


30 


0.16 


0.20 


0.17 


0.16 


0.20 


0.18 


0.15 


0.20 


0.17 


0.15 


31 


0.17 


0.20 


0.17 


0.16 


0.20 


0.18 


0.15 


0.20 


0.17 


0.15 


32 


0.16 


0.20 


0.17 


0.16 


0.20 


0.18 


0.15 


0.20 


0.17 


0.15 


33 


0.16 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


0.20 


0.17 


0.15 


34 


0.16 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


0.20 


0.17 


0.15 


35 


0.16 


0.20 


0.18 


0.16 


0.20 


0.18 


0.16 


0.20 


0.17 


0.15 


36 


0.11 


0.10 


0.09 


0.08 


0.09 


0.09 


0.08 


0.09 


0.09 


0.08 


37 


0.09 


0.10 


0.09 


0.08 


0.09 


0.09 


0.08 


0.09 


0.08 


0.08 


38 


0.08 


0.10 


0.09 


0.08 


0.09 


0.09 


0.08 


0.09 


0.08 


0.08 


39 


0.07 


0.09 


0.08 


0.08 


0.09 


0.09 


0.08 


0.09 


0.08 


0.08 


40 


0.07 


0.08 


0.07 


0.07 


0.08 


0.08 


0.07 


0.08 


0.07 


0.07 


41 


0.07 


0.08 


0.07 


0.07 


0.08 


0.08 


0.07 


0.08 


0.07 


0.07 


42 


0.07 


0.08 


0.07 


0.07 


0.08 


0.08 


0.07 


0.08 


0.07 


0.07 



variations this monotonic convergence is not observed at £ = 12 as discussed 
in Section 3. 

Table 4 also shows that the size of the annual revisions are rather con- 
stant for consecutive filters within each year; that is, from £ = 12 to £ = 23, 
£ = 24 to £ = 35 and £ = 37 to £ = 42. This pattern of annual revisions 
implies that all changes observed in month-to-month comparisons within the 
same year are attributed mainly to the innovations entering into the series. 

However, the most common practise of revising current seasonally ad- 
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justed data consists of keeping constant the concurrent estimate from the 
time it appears until the end of the year and then revising annually the 
current and earliest years, generally up to three. Consequently, first-year re- 
visions are given by second-year revisions 

by and third-year revisions by 

^(25,13)^ 2^(26,14)^ Table 4 shows the second and third-year re- 

visions and Table 5 shows first-year revisions. Given that these latter are 
generally the most relevant for decision making, the revisions are shown for 
two frequency bands and for all frequencies. Since the pattern is similar 
when using ARIMA extrapolation only two of the cases are shown. 



Table 5. First Year Root Mean Square Revisions, 
of the Concurrent Seasonal Adjustment Filter for Selected 
Frequency Bands and Over All Frequencies 



With Extrapolations 

Without 

Extrapolation ^ = .40 © = .80 ^ = .80 0 = .80 



t 


Tr. 


Ir. 


Tot. 


Tr. 


Ir. 


Tot. 


Tr. 


Ir. 


Tot. 


1 


.02 


.13 


.12 


.04 


.13 


.12 


.05 


.06 


.06 


2 


.03 


.14 


.13 


.07 


.13 


.13 


.08 


.08 


.08 


3 


.04 


.14 


.13 


.09 


CO 


.13 


.11 


.08 


.09 


4 


.04 


.14 


.13 


.10 


.13 


.13 


.13 


.08 


.09 


5 


.04 


.16 


.15 


.11 


.13 


.13 


.14 


.08 


.09 


6 


.04 


.18 


.17 


.11 


.13 


.13 


.15 


.08 


.09 


7 


.05 


.17 


.16 


.11 


.13 


.13 


.15 


.08 


.09 


8 


.05 


.17 


.16 


.11 


.13 


.13 


.15 


.08 


.09 


9 


.05 


.17 


.16 


.11 


.13 


.13 


.15 


.08 


.09 


10 


.05 


.17 


.16 


.11 


.14 


.14 


.15 


.08 


.09 


11 


.05 


.17 


.16 


.11 


.14 


.14 


.15 


.08 


.09 


Tr. = 


Trend-cycle (0 < a; < .05) 












Ir. = 


Irregular (.05 < 




.5) 












Tot. = 


= Total (0 < c<; < .5) 
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We can observe that there is a mononotic increase in the revisions of the 
low frequencies from £=lto£=3or4 and thereafter they remain rather 
constant. On the other hand, the revision of the high frequencies seems to 
be rather constant for £ > 2 if ARIMA extrapolations are used and for £ > 7 
if no extrapolations are made. 

The advantage of this common scheme of revisions is that when doing 
month to month comparisons, all changes within the first year are due mainly 
to the innovations since the filter applied during the current year is always 
the same, i.e., the concurrent seasonal adjustment filter. Furthermore, the 
filters of the previous years are modified by almost the same amount within 
each year with the exceptions being £ = 1, 2, 12 and 13 in most C2ises. 

4.3 Combining Monthly and Annual Revisions 

The results discussed in Sections 4.1 and 4.2 suggest that a better scheme 
of revisions than the common practice should include monthly 2 is well as an- 
nual revisions since the largest single period revisions occur at £ = 1 and 12. 
It is expected that: (1) adjusting concurrently each month, say from January 
to November and revising only once when the next month is available, and 
(2) adjusting concurrently December when it first appears and then revis- 
ing the first year and earlier years when January is added, should improve 
the reliability of the filter applied during the current year while maintaining 
simultaneously the filter’s homogeneity for month to month comparisons. 

The first-year revisions would then be . . ., 

and These revisions are shown in Table 6. 

We can see that the pattern is very similar to that for the concurrent 
filter but the size of the revisions are smaller in all cases which agrees with 
our expectations. 

This scheme of combining monthly and annual revisions has been shown 
to produce smoother seasonally adjusted series and smaller revisions when 
applied to real data (Kenny and Durbin, 1982; Dagum and Morry, 1984). 

5. CONCLUSIONS 

This study has addressed the problem of how often the concurrent sea- 
sonal adjustment filter of X- 11- ARIMA with and without extrapolations 
should be revised. It is shown that: 

(I) The time path of the concurrent filter for consecutive month-spans, 
£ = 1,2,. ..,41 approaches nearly monotonically to the final symmetric 
filter (£ = 42). The use of the ARIMA extrapolation option decreases 
the size of the total revision while increasing the concurrent filter’s speed 
of convergence to the final symmetric filter. However, it has been ob- 
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Table 6. First- Year Root Mean Square Revisions, 
of the First-Month Revised Seasonal Adjustment Filter 
for Selected Bands and Over All Frequencies 



With Extrapolations 



Extrapolation 


II 

o 

<S> 

II 

bo 

o 


o 

00 

II 

o 

00 

II 


£ Tr. Ir. Tot. 


Tr. Ir. Tot. 


Tr. Ir. Tot. 



2 


.01 


.07 


.07 


.03 


.10 


.10 


.04 


.06 


.06 


3 


.02 


.07 


.07 


.05 


.10 


.10 


.07 


.07 


.07 


4 


.02 


.07 


.07 


.07 


.10 


.10 


.09 


.07 


.07 


5 


.02 


.08 


.08 


.07 


.10 


.10 


.11 


.07 


.08 


6 


.03 


.10 


.10 


.08 


.11 


.11 


.11 


.07 


.08 


7 


.03 


.11 


.11 


.08 


.11 


.11 


.12 


.07 


.08 


8 


.03 


.11 


.11 


.08 


.11 


.11 


.12 


.07 


.08 


9 


.03 


.11 


.11 


.08 


.11 


.11 


.12 


.07 


.08 


10 


.03 


.12 


.11 


.08 


.11 


.11 


.12 


.07 


.08 


11 


.03 


.12 


.12 


.09 


.12 


.12 


.12 


.07 


.08 


Tr. = 


Trend-cycle (0 < 




.05) 













Ir. = Irregular (.05 < a; < .5) 
Tot. = Total (0 < a; < .5) 



served that the revisions of the low frequencies are larger after 12 months 
than when 24 or 42 have been added to the series. This inconsistency 
was already noted by Burridge and Wallis (1984) when fitting ARIMA 
models to the transfer functions of the concurrent and first-year revised 
filters. This inconsistency disappears for most cases after 16 months 
and, in all C£ises, the revisions of the concurrent filter are equivalent to 
those obtained from the final filter after 24 months have been added to 
the series. These results imply that second-year revisions should suffice 
from the viewpoint of filter changes. 

(II) The monthly revisions of the concurrent filter do not approach mono- 
tonically either to the annual or to the final filters. The larger one-single 
period revisions occur at £ = 1,2,3,12,13,24, and 25. There are sig- 
nificant decreases for the first three consecutive revisions due to the im- 
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provement of the end weights of the Henderson trend-cycle filter. There 
is a reversal of direction in the size of the filter revision at £ = 12 and 
£ = 24 due to an improvement in the seasonal weights which become 
less Eisymmetric from year to year until three full years are added to the 
series. There are two large decreases at £ = 13 and £ = 25 which are due 
to the fact that, the seasonal estimates are forced to add to zero (12) over 
each calendar year if an additive (multiplicative) decomposition model 
is assumed. 

The annual revisions of the concurrent filter and remaining monthly 
asymmetric filters of the first year approach monotonically to the final filter- 
ing root mean square over all the frequencies but not for each frequency; in 
particular, those frequencies associated with the trend-cycle component are 
revised more as compared to the total revision (distance between concurent 
and final filter). It is also observed that the annual revisions are rather 
constant for each filter within the same year. 

Taking into consideration the patterns of monthly and annual revisions, 
the best combination of frequency of revision of the concurrent filter would 
be to revise when a new month appears, keep the estimate constant for the 
remainder of the year and then, revise annually when the first month of 
the next year is available. This scheme offers the following advantages: (1) 
by revising each month once, the reliability of the concurrent filter increases 
significantly and since the revised filter is kept constant during the first year, 
changes in the month-to-month comparisons are due only to the innovations; 
and (2) by revising annually, the reliability of the filters improves, while the 
comparability of consecutive filters is maintained since they are all revised 
by almost a constant amount (without introducing frequency distortions) 
within each year. This scheme has shown to produce smoother se 2 usonally 
adjusted series and smaller revisions when applied to real data by Kenny 
and Durbin (1982) and Dagum and Morry (1984). 
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A WALSH-FOURIER APPROACH TO THE 
ANALYSIS OF BINARY TIME SERIES 

ABSTRACT 

A nonparametric approach to analyzing a stationary binary time series 
{A(n), n = 0, ±1, ±2, . . .} taking values in {0, 1} is discussed. The analysis 
is accomplished in the spectral domain using the Walsh-Fourier transform 
which is based on Walsh functions. This seems to be a natural alternative 
to the trigonometric functions used in the usual spectral analysis since the 
Walsh functions take on only two values, +1 or —1, (or “on” and “off”, 
as does the series X{n) itself). This approach enables the investigator to 
analyze a binary series in terms of square-waves and sequency (switches or 
changes per unit time) rather than sine-waves and frequency (cycles per 
unit time). We discuss (1) the b 2 isic theory of Walsh-Fourier analysis, (2) 
the computational aspects involved in calculating the discrete Walsh-Fourier 
transform, and (3) the analysis of simulated and real binary data in the se- 
quency domain. We suggest that these methods would enhance the analysis 
of time series which take values in a discrete finite set. 



1. INTRODUCTION 

Implicit in the spectral (Fourier) analysis of time series is one of two 
“extreme” assumptions about the process: (a) the very long stretch of the 
time series is the only time series we want to consider and consists of the 
superposition of not too many sinusoidal terms of substantially different 
frequencies; (b) the time series is to be regarded as a realization of an ergodic 
Gaussian process; it is one of many possible time series and the analyses are 
directed toward the properties of the ensemble of the series, not toward those 
of a specific realization (Brillinger and Tukey, 1982). 

^ Department of Mathematics and Statistics, University of Pittsburgh, Pitts- 
burgh, Pennsylvania 15260 (both authors) 
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However, there are many situations in which time series are patently 
non-normal. Similarly, there are processes, such as those that take values 
in a discrete finite set, which cannot be thought of as the superpositions of 
well separated sinusoids. For the case of continuous valued non-normal time 
series it is perhaps still reasonable, in some cases, to do spectral analysis us- 
ing Fourier (trigonometric) methods. However, in the cases where the time 
series takes values in a discrete finite set, it makes little sense to correlate the 
data with sines and cosines. As an alternative, we suggest that the spectral 
analysis of discrete- valued time series be accomplished in the “sequency” do- 
main via the Walsh-Fourier transform (Ahmed and Rao, 1975; Kohn 1980a, 
1980b; Morettin, 1981, 1983; Stoffer, 1985). This seems to be a natural al- 
ternative to the usual Fourier analysis since the Walsh-Fourier transform is 
based on the “square- wave” Walsh functions (Ahmed and Rao, 1975; Fine, 
1949, 1950, 1957; Morettin, 1974b, 1981, to mention a few). This approach 
would enable investigators to analyze a discrete- valued time series (which we 
may think of as a square-waveform) in terms of square-waves and sequency 
(switches per unit time) rather than sine-waves and frequency. As empir- 
ically demonstrated by Beauchamp (1975), “the respective roles of Walsh 
and Fourier spectral analysis for discontinuous and smooth-varying signals 
are clear. Where the signal is derived from sinusoidally-based waveforms, 
Fourier analysis is relevant. Where the signal contains sharp discontinuities 
and a limited number of levels, Walsh analysis is appropriate” . 

The Walsh functions, which are defined via the Rademacher functions 
(see Ahmed and Rao, 1975; Kohn, 1980a), form a complete orthonormal 
sequence on [0,1) and take on only two values, +1 and —1 (or “on” and 
“off”). They are ordered by the number of zero-crossings which is called 
sequency. If W(n,7), n = 0,1,2,..., 0 < 7 < 1, denotes the nth Walsh 
function, then W(n, •) makes n zero-crossings in [0, 1). 

The first eight sequency ordered discrete Walsh functions W(n,m/iV), 
n, m = 0, 1, . . . , 7, corresponding to a sample of size N = 2^ are shown in 
Figure 1 in an 8 x 8 matrix. We note that other types of orderings exist, for 
example Paley order and Hadamard order (Ahmed and Rao, 1975). However, 
sequency ordering is more natural in that it is comparable to the frequency 
ordering of sines and cosines. We will discuss methods of generating the 
Walsh functions in Section 3. 

Walsh spectral analysis has been used for several purposes, primarily 
in the Engineering sciences, such as speech processing, word recognition, 
image coding and transmission, filtering and multiplexing. See, for exam- 
ple, the 1971 and 1973 Proceedings on the Applications of Walsh Functions, 
Beauchamp (1975) and Harmuth (1972) to mention a few. Applications of 
the Walsh-Fourier transform in statistics are scarce and we discuss a few here. 
Robinson (1972) compared the Walsh-Fourier transform using a Parzen ker- 




WALSH-FOURIER ANALYSIS OF BINARY TIME SERIES 



149 



W{ 0 ,m/N)~* 


1 


1 


1 


1 


1 


1 


1 


1- 


W{l,m/N)-* 


1 


1 


1 


1 


-1 


-1 


-1 


-1 


W{ 2 ,m/N)-* 


1 


1 


-1 


-1 


-1 


-1 


1 


1 


wld,m/N)-* 


1 


1 


-1 


-1 


1 


1 


-1 


-1 


VF(4,m/JV)-» 


1 


-1 


-1 


1 


1 


-1 


-1 


1 


Wl 5 ,m/N)-* 


1 


-1 


-1 


1 


-1 


1 


1 


-1 


Wl 6 ,m/N)-* 


1 


-1 


1 


-1 


-1 


1 


-1 


1 


Wll,m/N)-^ 


.1 


-1 


1 


-1 


1 


-1 


1 


-1. 



Figure 1. Sequency-ordered discrete Walsh functions for a sample of size 
N = S. 

nel with the usual Fourier transform for first order Markov processes. Ott 
and Kronmal (1976) used the Walsh transform in classification and predic- 
tion problems for strictly stationary binary time series. Panchalingam (1985) 
analyzed simulated and real binary time series in the sequency domain. 

Theoretical results concerning the statistical application of Walsh spec- 
tral analysis to stationary time series are relatively new and limited to the 
works of Kohn ( 1980a, b), Morettin (1974a, 1981, 1983) and Stoffer (1985). 
We note here that some work h2us been done by others on the statistical 
aspects of Walsh spectral analysis for “dyadic” stationary time series (see 
Morettin, 1981, for discussions and references). Although dyadic time has 
some theoretical appeal, it is of little practical use. We, therefore, concen- 
trate on real time stationary processes. 

2. THE WALSH-FOURIER TRANSFORM 

Our discussion will be restricted to univariate time series; the multivari- 
ate versions follow in an obvious way (see Kohn, 1980b, Section 3). 

Throughout this section, we suppose that X(0), JY(1), . . .,X(iV — 1) is 
a sample of length N = 2 ^ {p a positive integer) from a zero-mean, weakly 
stationary time series, A(n), with absolutely summable autocovariance func- 
tion, 7(h), h = 0 y ±1, ±2, — Let 

N-l 

djv(A) = N~i A), 0 < A < 1 

n=0 

be the finite Walsh-Fourier transform of the data. The logical covariance 
(Kohn, 1980a) is defined to be 

N-l 

k =0 
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where j ® k means the dyadic addition of j and k. It can then be shown 
that the variance of djv(A) is given by 



iV-l 

y<iT{dN{x)}=Y^T{j)w{j,x). (1) 

3=0 

Taking the limit [N — ► oo) in (1) we see that Var{dj^(A)} -> /(A), where 

oo 

/(A) = Y, A), 0 < A < 1, (2) 

3=0 

is called the Walsh-Fourier spectrum of X(n). We note that /(A) exists since 
the absolute summability of 7(h) implies the absolute summability of r(y) 
(Kohn, 1980a, Lemma 3). Specifically, Kohn has shown the existence of /(A) 
under the condition that 

E (1 - ^) I I < 

If -X'(O), . . ., A’(Ar~l) is a sample of length N = 2^, the transform div(A) 
is calculated for A = m/AT, m = 0 , . . . , AT — 1. It can be shown that 

W (n, m/N) = W (m, n/N) m, n = 0, 1 , . . . , W - 1 (3) 

and hence the value of A in the discrete Walsh-Fourier transform corresponds 
to sequency. As with the usual Fourier analysis, if the mean of the series is 
unknown, the only sequency of the form m/AT at which the transform cannot 
be evaluated is at the zero sequency (m = 0). To see this, let /i = £7[AT(n)], 
all fly and note that for m = 0, 1 , . . . , AT — 1, 

n=0 ' 

Relationships (3) and (4) are given by Kohn (1980a, Lemma 1). It is clear 
from (4) that the mean centered transform will be the uncentered transform 
except at m = 0, and in particular 



E{dN{m/N)} = N-^J2 I^W{n,m/N) = { 

n=0 ^ 



if m = 0 
if m ^ 0, 



m = 0,l,...,Ar~ 1. 
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Various authors have established central limit theorems for the\ finite 
Walsh- Fourier transform, c?//(m/iV’), under a wide range of conditions 
(Kohn, 1980a; Morettin, 1983; Stoffer, 1985). We state the version given 
by Morettin (1983). 

Theorem 2.1. Let X[n) be a zero-mean strictly stationary time series 
with finite moments and let Cr{ji, . . . yjr) = cum{JY(yi), . . ., JY(yV)} be 
the rth cumulant of -X’(n), = 0,±1,±2,.... Further, suppose 

E~=o"'X)~_,=o I I < OO. Then djv(A) converges in dis- 

tribution to a normal variate with zero mean and variance /(A) given by 

( 2 ). 

In order to be able to estimate consistently the Walsh-Fourier spectral 
density, we shall need asymptotic results for smoothing the transform 
The following lemma is given by Kohn (1980a, Corollary 3). 

Lemma 2.1. Let -X’(n) be a strictly stationary zero-mean time series with 
absolutely summable autocovariance function, and suppose Xn and are 
dyadically rational (i.e., their binary representations are finite). 

(1) If A = /X, I Xn — /ijv I ^ Ajv 0 A — ► 0 and fipf 0 /x — ► 0 as 

W = 2^ — > oo, then J57{d/sr(Aiv)c?Ar(MAr)} 0- 

(2) If A// 0 A -> 0, then £^{d^(Aiv)} — ► /(A) as iV — ► oo. 

In particular. Theorem 2.1 and Lemma 2.1 gives us the following useful 
result for estimating the Walsh-Fourier spectral density. 

Corollary 2.1. Let Xj^N = 1 < j < N — 1 and suppose for 

{Aj(i),iv, • • • , Aj(M),iv}, SA-^OasiV-^oo, A; = l,...,Af, and 

1 ~ I ^ X for £ ^ kf £, A; = 1, . . . , A/. 

Then div-^iV'(0, A) where djv = (^^isr(Ay(i),N), • • ^^iv(Ay(M),iv))' and 
A is an Af X Af diagonal matrix with /(A) along the diagonal. Also, 

dJvdiv-^/(A)xM so that M~^ djy^djv is an estimate of /(A) having vari- 
ance 2p{X)/M. 

In particular, if we let Af — ► oo eis iV — > oo with Af/iV -> 0 in Corollary 
2.1, the smoothed periodogram Af“^dJ^d// is a mean square consistent es- 
timate of the Walsh-Fourier spectrum /(A). 
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3. COMPUTATION OF THE WALSH-FOURIER TRANSFORM 

The Walsh functions are usually calculated via the Hadamard ma- 
trix which, for a sample of size N = 2^ {p a positive integer) is de- 
fined to be a symmetric orthogonal N x N matrix whose (u, v)th element, 
u,v = 0, 1 ,..., A - 1, is equal to where the bi- 

nary representations of u and v are given by (ui, . . . , Up) and (vi, . . . , Vp), 
respectively, with u^,vy = 0 or 1. The Hadamard matrix can be generated 
recursively by ff(0) = 1, and 

’H{k) H{ky 

,A: = 0,1,2,.... 

H{k) -H{k) 

For example, for a sample of size AT = 4 we would calculate 




1 

and H{2) = J 
.1 



1 

-1 

1 

-1 



1 

1 

-1 

-1 




The Hadamard matrix gives the Walsh functions as rows (or columns) in 
what is called natural or Hadamard ordering. To obtain the Walsh functions 
in sequency ordering we simply reorder the rows of H(*) according to the 
number of sign changes. Another method used “bit-reversal Gray code” 
(Ahmed and Rao, 1975) to rearrange the rows of !!{•). These methods, 
however, are not very efficient. 

If H{2) = [Ho( 2 ),Hi( 2),H2(2 ),Hs( 2)] where Hi(2),i = 0, 1,2,3 is the 
ith column of H(2), then the corresponding Walsh-ordered Hadamard ma- 
trix is H^{2) = [Ho( 2),H2(2 ),Hs( 2 ),Hi( 2)]. The procedure of obtaining 
the Walsh-ordered Hadamard transform from its definition either requires 
storage of the Hadamard matrix, or recomputation whenever the elements 
of (p) are needed. Hence, either the sample length is restricted to about 
p = 10 or 20, or the procedure is extremely slow. There are, however, fast 
methods which can reduce the number of computations (additions and sub- 
tractions) by a factor of approximately 2^"^/p from the number using the 
definition. The method we discuss here and a computer subroutine are given 
by Ahmed and Rao (1975, Chapter 6). The Walsh-Hadamard matrix can be 
computed as 

ff„(p) = f[ Hi{p ) . B 

i=l 



( 5 ) 
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where 






G, 



0 



Hi(p) = 




_ nt-l 



LO 



Fs 



Gs j 



with 



F. = 

G, = 



h h 

I, -Is, ’ 

Is -Is] 



Is 



Is 



and Iq being the 8X8 identity matrix. The matrix B in (5) is a matrix 
which bit-reverses the order of the data. For example, with N = 2^ the 
bit-reversal of 1 = (0, 0, 1) is 4 = (1, 0, 0), and the bit-reversal of 3 = (0, 1, 1) 
is 6 = (1, 1,0) so that X{1) is exchanged with -X’(4) and X(3) is exchanged 
with -X'(6) in the data vector. If X = (X(0), . . . , X(iV’ — 1))' is the data 
vector (AT = 2^), the fast finite Walsh- Fourier transform is computed 3S 



p 

= N-^IIJp)X = N-i n JIi(p) . BX, 



where Ajy = (0/N, l/N, . . (JV — 1)/J\T)'. For example, if IV = 2®, the 
Walsh-ordered Hadamard matrix can be decomposed as 



r^i 



ff„(3) = 



Gi 



LO 



Fi 





1 

0 

1 




0 Gi] 



FiB, 
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where 

‘1 000 I 0000 

0000 I 1000 

001 0 | 0000 

oooojoolo 

B= - -- -- -- -- 

0100 I 0000 

oooojoioo 
0001 I 0000 

.0 000 I 0001 

The vector of periodogram ordinates is obtained by squaring each ele- 
ment of djv(Aiv). Let /jv(^) = <ienote the mth periodogram or- 
dinate, m = 0, 1, . . . , iV - 1. It is seen that In{^) = f{j)W {j, m/N), 

where f{j) = N~'^ Ylk=o ® ^)- Employing relationships (3) and 

(4) we may write 

N-l 

KJ) = E lN{m)W{m,j/N). (6) 

m=0 

Thus, for large iV, the quickest way to compute f(y) is to use the fgist Walsh- 
Fourier transform twice, once to compute In{^) and once to compute the 
right-hand side of (6). 

4. SEQUENCY ANALYSIS OF BINARY TIME SERIES 

First, we discuss a possible model for a binary- valued time series where 
Walsh- Fourier analysis is desirable as well as superior to trigonometric anal- 
ysis. Consider a binary version of the signal-plus-noise models used for 
Gaussian processes. In general, write the binary time series {X(n),n = 
0,l,2,...,},X(n)e{0,l},a8 

X(n) = 5(n) + e(n), n = 0,1,2,..., (7) 

where iS'(n) is a random stationary binary (not necessarily 0 or 1) signal 
which possibly depends on parameters $ = ^ 

is a zero-mean binary (not necessarily 0 or 1) white noise process which 
is uncorrelated with the signal 5(n). Let A(n) satisfy the conditions of 
Theorem 2.1. 

For a specific example of such a model consider a two state (0 or 1) 
Markov process. Let A(n) be the value of the process at time n, let 0 be 
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the probability of being in state 1 at any given time and let t, y = 0, 1 
denote the transition probabilities. Then we may write 

S[X(n)] = ^;[X(n - l)]pii + E [1 - X{n - l)]poi 

from which we obtain the signal-plus-noise model 

X{n) = X{n — l)pii + (1 — X(n — l))poi H~ ( 8 ) 

where e(n), n = 0, 1 , 2, . . . , is a zero mean Bernoulli type white noise process 
whose value depends on the value of X(n — 1 ), but is uncorrelated with the 
signal S{n) = X{n— l)pn + (1 — X(n — l))poi* To see this note that 

Gov (e:(n),X(n- 1)) = £?{[X(n) - poi -X{n- l)(pn - Poi)] - 1)} 



— Opii - 6 poi - 9 {pii - poi) = 0. 

Hence (8) is of the binary signal-plus-noise form given in ( 7 ), where the 
binary signal 5 '(n) is a function of two parameters poi and pn. 

For the binary signal-plus-noise models given by ( 7 ), Walsh-Fourier anal- 
ysis would be useful for detecting whether or not a binary signal exists in the 
time series (as opposed to the series being white noise) , and if so, determine 
the cyclic behavior (in terms of sequency) of the binary signal. 

For example, if we denote the autocovariance functions of S{n) and e(n) 
by 75(^) and 7e(/i), h = 0 , ±1, ±2, . . . , respectively, where 7«(h) = 0 for A ^ 
0 , and assume that 75 (A) is absolutely summable, then the Walsh-Fourier 
spectral densities of S (n) and e(n) exist. Denote fs (A) = ts {j)W (y , A) 

and /e(A) = ^ ^be Walsh-Fourier spectra of iS'(n) and 

e(n), respectively, where Ts{j) and Te{j) are the logical covariances of 5 (n) 
and e(n), respectively. Clearly then, by the model assumptions, the Walsh- 
Fourier spectral density of X(n), say /x(A), is given by /x(A) = /5(A) + 
/e(A). Now, let X{ 0 ) y X{N — 1 ) be a sample from the time series ( 7 ) and 
let d7v(Aj(£)^7v), t = l,...,Af, denote the finite Walsh-Fourier transforms 
of the sample where are as in Corollary 2.1. Then by Corollary 2.1 

and Skorokhod’s Representation Theorem (Skorokhod, 1956), the smoothed 
periodogram has the representation 

M 

jjv(A) = M-i = M-M/s(A) + UX)]xl a.s. 

^=1 



as AT —> 00. If no signal is present, then fs{-) = 0 and the smoothed peri- 
odogram behaves asymptotically like a constant times a variate divided 
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by its degrees of freedom. Otherwise, the binary signal dominates the behav- 
ior of iW(A). We note here that models of the type given in (7) will extend 
eeisily beyond the two state process to time series which take on values in a 
discrete finite set. 

How well the Walsh-Fourier transform describes continuous valued pro- 
cesses has been explored empirically by various authors (Ahmed and Rao, 
1975; Beauchamp, 1975; Harmuth, 1972; Robinson, 1972, to mention a few). 
How well the Walsh-Fourier transform describes discrete-valued time series, 
in particular binary time series, is explored by Panchalingam (1985). We 
discuss some of the findings here. 

First, suppose that {X(n),n = 0, ±1, ±2, . . ., } is a binary time series 
generated by “clipping” or “hard limiting” (Kedem, 1980). That is, let 
{Z{n)y n = 0, ±1, ±2, . . . , } be an unobservable strictly stationary, continu- 
ous valued time series and put 




if Z{n) > u 
if z\n) < u. 



u fixed. 



( 9 ) 



This is a reasonable model for various situations; for example, a person will 
have an allergic reaction when the pollen level crosses a certain threshold. 
Various binary processes were simulated and analyzed in the sequency do- 
main using the model (9) of Panchalingam (1985). We present some of the 
examples here. In what follows, e(n) is an i.i.d. N{0y 1) sequence. Consider 
the following cases: 



(i) Z{n) = 0.9Z(n — 1) H- ^(n) 

(it) Z{n) = 0.25^(n - 1) - 0.9^(n - 2) + €:(n) 

(ill) Z{n) + 0,9Z{n - 1) = e(n) + 0.25e(n - 1). (10) 

Figure 2 shows the Walsh-Fourier periodogram smoothed as follows over 5 
sequencies 

= i (11) 

^ Jk=-2 

from a sample of size AT = 2^ from the clipped process 10(i) with u = 0 in 
(9). Similarly, Figures 3 and 4 show the smoothed periodogram (11) for the 
clipped processes lO(ti) and lO(tii), respectively, for samples of size A = 2^. 
Each figure is plotted on a In scale. 

Second, we consider the discrete ARM A (DARMA) processes described 

by Jacobs and Lewis (1978). Let {/(n),n = 0,1,...}, {J(n),n = 0,1 } 

and {y(n), n = 0, 1, . . .} be mutually independent i.i.d. Bernoulli sequences 
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such that P{7(n) = 1} = p, P{J{n) = 1} = q and P{Y{n) = 1} = The 
binary DAR(l) model is written (with X(0) = ^(0)) as follows: 

X(n) = I{n)X{n - 1) + (1 - 7(n))y (n), n = 1, 2, . . . . (12) 

Note that E[X[n)] = 0 and Corr(X(n), X(n -f h)) = p^,h = 0,1,2,.... 
Figure 5 shows the smoothed Walsh-Fourier periodogram (11) plotted on In 
scale for N — 2^^ observations from (12) with p = 0.75 and ^ = 0.50. 

Next we consider the DARMA (1,1) which is written as follows: 

A(n) 7(n)A(n - 1) + (1 - 7(n))R(n), (13) 



where 

B{n) = J(n)y(n) + (l - J(n))y(n - 1). 

In this case 77[A(n)] = B and Corr(A’(n), A’(n + /i)) = p^“^{pH-(l-p)^g(l- 

q'j}^h= 1,2, Figure 6 shows the smoothed Walsh-Fourier periodogram 

(11) plotted on In scale for A = 2^ observations from (13) with p = 0.90, 
q = 0.25, and 0 = 0.50. 

As a real data example, we compare the state of the diastolic blood 
pressure (DBP), classified either “high” or “normal”, of a mild hypertensive 
being treated by diet over two periods of time. The first period is 2 months 
of data (2 observations per day) after an initial 2 months on the diet. The 
second period is 2 months of data (2 observations per day) after 4 months on 
the diet. Figure 7 shows the periodogram for the data from the first period, 
and Figure 8 shows the periodogram for the second period. In each case, 
the periodogram has been smoothed by a simple average over 5 sequences 
as given in (11). From Figure 7 we see that there is virtually no signal, 
that is, there is extremely low power and the periodogram is relatively fiat. 
However, in Figure 8 we note that there is power at the lower sequencies 
(that is, the data is predominantly long runs of “highs” and “normals”). We 
note the similarity between Figure 8 and the clipped AR(1) periodogram 
given in Figure 2. One could conjecture that there is a Markov signal and 
although the patient is on treatment, whether or not his DBP is “high” or 
“normal” is conditionally dependent on the state of the most recent past. 

5. SUMMARY AND CONCLUSIONS 

It has been known, primarily in the engineering disciplines, that Walsh 
spectral analysis is superior to Fourier spectral analysis for non-sinusoidal 
time series. This is primarily due to the empirical analyses performed by 
Ahmed and Rao (1975), Beauchamp (1975), Harmuth (1972), and Robinson 
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Figure 4. Clipped ARM A (1,1) 




Figure 5. DAR (1) 
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Figure 6. DARMA (1,1) 

(1972). As we have demonstrated in Section 4, Walsh- Fourier analysis is also 
informative in the analysis of binary time series and may also be superior 
in analyzing discrete-valued and non-Gaussian continuous-valued processes. 
As previously mentioned, very little work has been done on the real time 
statistical theory of the Walsh- Fourier analysis of discrete- valued time series 
and hence there are many avenues open for development. 

In particular, one may develop models and corresponding analyses for 
discrete-valued time series along the lines of the binary signal-plus-noise 
model given in Section 4. Also, we may consider analysis of power for 
discrete- valued time series where one observes Zq{n),q = l,...,Q,n = 
0,l,...,iV — 1, which are Q independent repeated observations on a discrete- 
valued signal-plus-noise process. One particular hypothesis of interest is 
whether or not the Q series have a common signal. Such analyses may be 
carried out in the sequency domain along the lines of the analysis of power 
for Gaussian time series described by Brillinger (1975, Section 7.9). In such 
cases a test of maxo<A<i /(A) would be of interest. 

We have also seen models in which the signal and the noise are corre- 
lated, such 3.S the DARMA models given by Jacobs and Lewis (1978) for 
which a Walsh-Fourier analysis similar to the uncorrelated case could be 
developed. The feasibility and practicality of models and filters for discrete 
valued stationary time series where spectral analysis in the sequency domain 
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is of primary interest must be explored further. Moreover, appropriate sta- 
tistical theory such as applicable central limit theorems and ergodic results 
must be developed. At present, most of the work in this area is empirical. 
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EXCITATION OF GEOPHYSICAL SYSTEMS 
WITH FRACTAL FLICKER NOISE 

ABSTRACT 

Geophysicists often model their measurements, derived from natural pro- 
cesses, as the linear superposition of a simple rational system function and 
a purely random excitation process. For many geophysical processes, the 
assumption of linearity for its deterministic component is sufficient but the 
assumption of a purely random excitation often and easily leads to a misiden- 
tification of the system function. Many geophysical systems are excited by 
stochastic processes which appear to be stationary even on geological time 
scales but which possess a preponderance of long period components. Self- 
similar, fractal stochEistic processes form a class of possible geophysical ex- 
citations having “power spectrum” of the form 1/ \ f \^. Of this class, 
flicker-noise processes, for which A; = 1 exist, on the boundary between the 
stationary and evolutionary subsets. No fractal stationary random excita- 
tion can provide for greater weighting of long period components. 

The Chandler wobble of the earth’s rotation axis can be essentially de- 
scribed as a single-pole linear system. The multitude of natural forces which 
contribute to its excitation combine as a stochastic process which is heavily 
weighted in long periods. Because of its basic importance in zistronomy, nav- 
igation, time-keeping, etc., the wobble has been carefully measured since the 
turn of this century. Recent advances in geodetic and astronometric tech- 
nology has provided a reliable, homogeneous data set which can be directly 
decomposed into a linear, deterministic wobble function and stochastic ex- 
citation. The use of the flicker-noise excitation model allows for the direct 
identification of the theoretically-simple, single-pole resonance. A subset of 
the pole position record, obtained from the Bureau International de THeure 
in Paris is analysed in terms of a modified autoregressive data model com- 
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prising an all-pole system function excited by a minimum-power, flicker-noise 
process. 



1. INTRODUCTION 

Linear data models are often used by geophysicists in the description 
and analysis of measurements which are regarded as being derived from 
the excitation of a deterministic system by some stochastic process. The 
excitation of the system represents some generally unknown but natural 
geological or geophysical variation. The system includes the instrumentation 
used in the observation and the current geophysical theory of the phenomena 
involved. Analysis and subsequent interpretation of the me 2 isurements allows 
the discovery of essential properties of the system and its excitation, that 
is of the characteristic geological condition and the manifest geophysical 
phenomena. Applying a linear data model, the geophysicist recognizes the 
excitation as a stochastic process corresponding to the model innovation; 
the system which is usually, but not generally, linear and rational contains 
certain undetermined parameters and corresponds to the operator function 
of the data model. 

Continuing technological developments improve the quality of measure- 
ments while the geophysical theory becomes an ever more detailed and com- 
plete description of natural phenomena. As a consequence, we are forced 
to elaborate the statistical models of the excitations of the systems. In this 
paper, we propose to argue, by example, in support of a particular statistical 
excitation model which we believe to be appropriate for the description of 
measurements derived from a wide cla^s of geophysical phenomena. 

2. STOCHASTIC MODELS OF GEOPHYSICAL DATA 

White-Gaussian or filtered white-Gaussian processes have long been used 
directly in (or implied by the standard methods of) spectral analysis and sig- 
nal decomposition employed in traditional geophysical analysis. The autore- 
gressive (AR), moving-average (MA) and autoregressive-integrated-moving- 
average (ARIMA) linear data models are commonly used in geophysical data 
modelling. These models are based upon the zissumption that the signal or 
measurement derives from a linear operation on uncorrelated Gaussian noise. 
In geophysics, and particularly in the analysis of seismic reflection data, the 
AR data model has been found to be most useful. The predictive decon- 
volution (then called decomposition) method for the analysis of the seismic 
reflection records was first suggested by Wadsworth et aJ. (1953). Robin- 
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son’s now classical ‘‘MIT GAG” report and Ph.D. thesis (Robinson 1954; 
republished 1967) formalized the method and properly recognized the roots 
of the technique in the work of Yule (1927). Common predictive deconvo- 
lution as practised in geophysical analysis is essentially a variation on the 
Box- Jenkins (1970) forecasting theory for autoregressive time series. Within 
the geophysical community, this AR-model-based deconvolution process has 
been much developed and elaborated to account for evolutionary (Clarke, 
1968) and multi-channel systems (Burg, 1964; Davies and Mercado, 1968; 
Treitel, 1970). Much work has been devoted to the development of efficient 
and accurate algorithms for the determination of the appropriate AR model 
from vsist seismic data sets (e.g. Burg, 1967, 1975; Wiggins and Robinson, 
1965; Ulrych and Clayton, 1976; Barrodale and Erickson, 1980; Tyraskis 
and Jensen, 1985). 

In the analysis and decomposition of seismic reflection data, the compo- 
nents of the AR model correspond to real physical elements. The models are 
“structural” (Akaike, 1985): the model innovation corresponds to the exci- 
tation of the linear geophysical system by a subjectively random geological 
condition. Because the geophysical system (comprising the instrumentation 
used in the seismic surveying technique, the source of seismic wave energy 
and the theoretically deterministic part of the seismic wave propagation 
phenomenon) is essentially resonant, a purely autoregressive model filter op- 
erator is most appropriate for its description. Little advantage has been 
found in using linear data models with a moving-average component. 

Most recent geophysical interest in the seismic linear data modelling 
problem has involved the non-Gaussian and/or self-correlated properties of 
the structural innovation function. Wiggins (1978) introduced the so-called 
minimum-entropy deconvolution method which obtains that linear filter op- 
erator which, while consistent with the seismic data, maximizes the kurto- 
sis of the structural innovation. Postic et aJ. (1980) have elaborated this 
method to allow for maximization of arbitrary fractional-order moments of 
the probability density function of the innovation. These methods are recog- 
nized to be clearly superior to the classical predictive deconvolution method 
which uses the minimum- variance (or least-squares) of the innovation as the 
solution criterion. Hosken (1980) showed that a stochastic model of the 
seismic reflectivity sequence, equivalent to the structural innovation in the 
modelling problem, as derived empirically from geological logs from several 
major petroleum-bearing sedimentary basins, is strongly leptokurtotic. This 
discovery justifies the current preference for maximum kurtosis, rather than 
minimum variance, as the criterion of choice in sophisticated seismic de- 
convolution. Vafidis (1984) and Jensen and Vafidis (1986) have extended 
these concepts to allow for extremal skewness and kurtosis as criteria in the 
solution of a more general class of inverse problems. 
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Hosken (1980) also showed that the acoustic impedance as a function of 
depth in a sedimentary geological sequence does not show a white spectral 
character and is therefore self-correlated. In particular, he showed that the 
velocity-depth function, which determines the seismic reflectivity sequence, 
has the nearly 1/ | / | spectral characteristic of flicker noise, where / is the 
frequency. This accounts directly for the well-known fact that a seismic re- 
flectivity sequence is deflcient in low-frequency power density in comparison 
to an uncorrelated sequence. The prior assumption that the seismic reflec- 
tivity sequence is uncorrelated and Gaussian, that is, the assumption of a 
purely random structural innovation in the data model, is not justiflable on 
empirical grounds. 

Non-white and/or non-Gaussian stocheistic models are important in 
many areals of geophysical analysis apart from seismic reflection deconvolu- 
tion. Indeed, Mandelbrot (1983) has discussed the ubiquitousness of “frac- 
tal” stochastic processes of the nearly 1/ | / | type (i.e. flicker noise) in 
many natural and geophysical phenomena. Jensen and Mansinha (1984) 
have shown that the Earth’s rotation pole-path is well modelled as the exci- 
tation of the Chandler resonance (Munk and MacDonald, 1960) by a fractal 
flicker-noise process. Also, Jensen (1982, unpublished) has shown that the 
basic geological excitation of airborne electromagnetic prospecting systems is 
best modelled as a flicker-noise process. One might reeisonably expect that 
general “geophysical landscapes” of topography, reduced gravity anomaly, 
electrical resistivity variations, magnetic anomaly or susceptibility, for ex- 
ample, can be best described as fractal flicker noise. 

Eventually, we must develop linear data modelling methods which simul- 
taneously allow for self-correlated and non-Gaussian structural innovations. 
Gauss (1839) solved the first geophysical inverse problem, modelling the 
worldwide observations of the geomagnetic field as a finite order expansion 
in terms of associated Legendre functions. Employing the least-squares cri- 
terion and assuming spatially uncorrelated differences between his model 
and data set, he proved that the Earth’s magnetic field is of internal ori- 
gin. While Gauss made no claim that the model structural differences were 
necessarily uncorrelated and of minimum variance, he was forced to choose 
such criteria for computational convenience. Geophysicists are not today so 
technologically disadvantaged; given the power and precision of contempo- 
rary computing machines, we need not fall back to Gauss’ choice of criteria. 
We can accommodate actual (geo) physical knowledge of the structural inno- 
vation in our data modelling procedures. We must now develop analytical 
methods which allow for non-Gaussian and self-correlated stochastic com- 
position. 
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3. FRACTAL STOCHASTIC PROCESSES 

In numerous papers and articles culminating in his book, The Fractal 
Geometry of Nature, Mandelbrot (1983) developed the concept of fractal 
(fractional dimensional) curves and surfaces. He defines his new geometry 
as a set for which the Hausdorff-Besicovitch dimension strictly exceeds the 
topological dimension. A most important feature of the fractal geometry is 
its self-similarity: each piece of the geometry is similar to its whole except for 
scale. Fractal curves and surfaces may be either deterministic, in the sense 
that they may be generated by a regular rule, or stochastic, and having 
inherent randomness. A subset of the stochzistic fractal geometry comprises 
the spectrally “scaling noises” which are characterized most simply by their 
I//**, 0 < A: < 2, power density spectrum. Mandelbrot states: 

“Many scaling noises have remarkable implications in their fields, 
and their ubiquitous nature is a remarkable generic fact.” 

One can demonstrate the fractal self-similarity of a time series of a frequency 
band-limited scaling noise as follows. 

1. Low-pass filter the sequence to reduce its bandwidth from —fo< f < fo 
to -fo< f < fo, 

2. rescale time: F = t /o//o> 

3. rescale amplitude: = d{fo/ • 

The form of the sequence is thus preserved. Within this class of noises, flicker 
noise for which fc = 1 is unique in that it represents that scaling noise which 
possesses the largest value of k (consequently the greatest weight of low fre- 
quencies) while remaining properly stationary. For A; > 1, the variance from 
the mean (or initial state) must increase with time and, therefore strictly, no 
power spectrum exists. However, the energy density spectrum of a sample 
of such noise does describe the 1/ form over the range of frequencies that 
may be estimated given the length of sample and its sampling increment. 
Uncorrelated or white Gaussian noise is spectrally scaling with A: = 0. A 
random walk in amplitude with Gaussian steps is spectrally scaling with 
A: = 2. Figure 1 shows the form of white noise, flicker noise and “brown” 
noise (the random walk with Gaussian steps) obtained from pseudo-random 
generators of these noises which were developed by one of the authors (O. 
G. J.). 

Geophysicists are, or should be, attracted to the flicker noise description 
of the randomness in the phenomena which they observe for several reasons. 
Geophysicists apply physics as their tool for the description of the condi- 
tions and phenomena of the Earth, planets, satellites and the solar-system 
environment. Properly covariant physics must apply everywhere in the uni- 
verse and for all time. More locally and specifically, many of the geological 




170 



O. G. JENSEN AND L. MANSINHA 



Excitation 
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(Model) 
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Figure 1. A linear system model representing the formation of geophysical 
measurements. Note the time-series equivalent terms shown bracketed. 

phenomena geophysicists attempt to describe by our theories are slowly evo- 
lutionary. We often presume that, during any time of observation which is 
short compared with geological time scales, the geophysical manifestations 
are essentially stationary. We further presume that our geophysical theories 
can equally well apply to any appropriate geological subset; that is, they 
should apply just as well in Africa £is they do in Canada. Then, recognizing 
the colloquial fact that most geophysical observations in time or space show 
a strong preponderance of low-frequency composition which cannot often be 
accounted for by basic geophysical theory, we are led to the choice of flicker 
noise as the preferred form of excitation or structural innovation in data 
modelling procedures. 



4. THE DECONVOLUTION PROBLEM 

The response of a linear geophysical system excited by a stochastic geo- 
physical or geological variation (Figure 2) is determined as the linear super- 
position of the excitation and system functions: 

/ oo 

e(s)/i(t — s)ds 

-oo 

= e{t) ♦ h{t)y 

where r(i) is the response, e(t), the excitation and h{t)y the system function. 
The symbol ♦ is used to represent the convolution or superposition integral 
form above. The stochastic excitation is assumed to correspond to some 
appropriate statistical model while being extreme in some sense. In the 
most simple deconvolution theory, the excitation is assumed to be purely 
random (frequency bandwidth-limited white- Gaussian noise) with minimum 
variance. The full complexity of the geophysical system, function /i(t), is 
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RANDOM PROCESSES 



White Gaussian Noise 




Flicker Noise 






Brown noise (random walk) 



Figure 2. Spectrally self scaling time series: (a) white Gaussian noise (spec- 
trum \j f^)j (b) Gaussian flicker noise (spectrum 1/ f^) and (c) brown noise 
or a random walk with Gaussian steps (spectrum \ j f^). The process mean 
(starting value for the random walk) is indicated by the mid-line; the pro- 
cess standard deviation, positive and negative from the mean is shown as the 
upper and lower lines. The random walk is non- stationary. For the flicker 
noise sequence (b), the sample mean does not necessarily closely correspond 
to the process mean. 

generally assumed to be unknown. We, however, desire its inverse, h~^{t), 
so that under convolution of this function with the observed and recorded 
response, r(t), we may determine the excitation, e{t) as follows: 

e{t) = r(t) * ^”^(0 



since 

h-\t) * h{t) = 6{t), 

where 6{t) is the Dirac impulse function. Nature and geophysical theory 
constrain the properties of h{t). Often, we expect this function to be properly 
causal, i.e.. 



h{t) = 0, t<0; 
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stable, i.e., 

r oo 

/ h^[i)dt finite, 

Jo 

so that energy be conserved, and usually but not always, we may expect 
that the function is of minimum-delay characteristic. This latter condition, 
also called the minimum-phase condition, is apparently appropriate for all 
completely described and passive temporal geophysical systems (Ulrych and 
Lasserre, 1966). It is essentially a variation of Fermat’s principle which holds 
that any physical signal follows the shortest or longest possible path. In this 
case, the power of the excitation is most quickly transferred through the 
system to provide its response. The causality and minimum-delay condi- 
tions do not necessarily hold for all geophysical deconvolution problems; in 
particular, because space, unlike time, does not evolve in a single direction, 
deconvolution problems involving geophysical space series (one or more spa- 
tial variables substituting for the time dependence in a time-series analogy) 
cannot presume these properties. Here, we shall restrict our attention to the 
simpler, proper time-series deconvolution problem for which we have h{t) 
causal and of minimum delay. This, then, allows that /i“^(^) is also causal 
and of minimum delay; the system is invertible. 

We are required to find that causal, stable, minimum-delay inverse func- 
tion, h~^{t)y from our observations record such that the excitation corre- 
sponds to the appropriate prior-assumed statistical model. In the classi- 
cal predictive deconvolution problem (Robinson, 1954), the excitation is as- 
sumed to be purely random and of minimum variance. 



5. POLAR MOTION 

The rotation axis of earth is not fixed to the earth but has periodic and 
secular motions within the earth. On the other hand, an observer in space 
would notice that the rotation axis is more or less stationary in space and 
it is the earth that has slow motions (in addition to the spin) about the 
rotation axis. The change in orientation of the earth in space also implies 
that, at any given point on the earth, all the stars would be displaced by 
identical amounts. Since the latitude of a place is measured by observing 
reference stars, the measured latitude at any given location will also reflect 
the apparent stellar displacement. The change in longitude appears as an 
error in the spin rate of the earth, which is often expressed as a change 
in the length-of-day (l.o.d.). The terms “variation of latitude”, “polar mo- 
tion” and “wobble” are used interchangeably to describe the same physical 
phenomenon (see Munk and MacDonald, 1960; Lambeck, 1980). 

The wobble has secular (i.e., long period), seasonal, annual, and 14 
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monthly components. Other minor components are also present. The 14- 
month period was first detected by S. C. Chandler and is usually referred 
to as the “Chandler wobble” . On the basis of the theory of rotating rigid 
bodies, Leonhard Euler predicted a 10-month free wobble for the earth in 
the eighteenth century. The lengthening of the period to 14 months is due 
to elastic yielding of the earth, as well as the presence of the ocean and the 
fluid outer core. Occasionally, the Chandler wobble is referred to as the “free 
Eulerian nutation” . 

The Chandler wobble arises whenever the axis of rotation is not coinci- 
dent with the polar axis of figure which is the axis of maximum moment of 
inertia. The pole then executes a slowly decaying spiral around the axis of 
figure with a 14-month periodicity until the two axes are coincident. One 
can view this motion as that of a damped harmonic oscillator. Thus, regard- 
less of the value of the “Q” of the oscillator, the rotation axis and the figure 
axis should have become coincident over geological time. At present, nei- 
ther the damping mechanism nor the excitation source have been definitely 
identified. Speculations abound. Geophysical interest in the phenomenon is 
spurred by the hope that identification of the two mechanisms will provide 
insight into the structure and properties of the earth. 

There is also a more utilitarian interest. Variation of the latitude and 
longitude affects the geographical reference frames and the measurements of 
time. Thus interest in the wobble has been high among geodesists and time- 
keepers. In 1899, the International Latitude Service (ILS) weis established. 
Since 1900, the ILS has been observing the position of the rotation pole. 
Originally the data was obtained from five ILS stations near the 39° N par- 
allel. In 1962, the ILS was reorganized into the International Polar Motion 
Service (IPMS). The IPMS continues the work of the ILS, but now includes 
data from other observatories. 

In 1955, the Bureau International de THeure (BIH) in Paris was en- 
trusted with the task of determining and predicting the path of the rotation 
pole to aid in timekeeping. The BIH uses time and pole position data from 
all available sources (see Mueller, 1969). In addition to data from optical 
instruments, the BIH heis been using pole positions determined from space 
geodesic and radio-interferometric methods. 

Although the physics of the wobble appears simple enough, the mea- 
surements present problems of extraordinary complexity. An earth “fixed” 
coordinate system is not immune to slow and undetectable drift. Possible 
causes are shifts in the local vertical and physical motion of the observato- 
ries due to tectonic processes. The mean pole has shifted from its position 
around 1900. In 1962, the International Union of Geodesy and Geophysics 
defined the mean pole position during the epoch 1900 to 1905 as the Con- 
ventional International Origin (CIO). The location of the CIO is fixed to 
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the earth only as well as the five original ILS stations are fixed. While 
the IPMS-ILS record has remained essentially homogeneous since its begin- 
ning, it has not been able to accommodate the remarkable improvements 
in the accuracy of measurements which have been achieved through con- 
temporary technology and applied in the compilation of the inhomogeneous 
BIH records. The advent of satellite geodetic and navigation systems and of 
radio-astronomical interferometry methods of measuring the pole path has 
resulted in a current standard error of measurement of less than 10 cm in 
each of the two orthogonal coordinates (along 0° and 90° E longitude, origin 
CIO) for the pole position compiled at 5-day intervals by the BIH. Unfor- 
tunately, this BIH record of “raw” values (no averaging or smoothing of the 
computed pole position) is not homogeneous for a long period because of 
the recent incorporation of contemporary technology. As in any time series 
problem, we would prefer to use the longest and cleanest record available 
for our analyses. Unfortunately, geophysical data sets of most recent epoch 
and, consequently, of short duration are most free of error. We are forced to 
select between the long noisy record and the short clean one. Through 1982, 
the maximum amplitude of the pole position offset from the CIO has been 
about 10.5 m (0°) and 17.5 m (90°E) and consequently, the standard errors 
of measurement correspond to about 1 part in 100-200. The BIH pole path 
record of raw offsets at 5-day sampling intervals which we use in this article 
remained approximately homogeneous from the beginning of 1978 through 
to the end of 1982. 

The pole-path record as compiled by the BIH comprises two major os- 
cillations with centres offset from the CIO: the Chandler wobble and the 
annual wobble. The annual wobble is a forced oscillation of the body of the 
earth caused by regular meteorological variations in the atmospheric and 
hydrological mass balance. Longer period climactic trends and cycles cause 
a low-level amplitude modulation of the annual wobble’s period of 365.2422 
(solar) days. It forms a slightly elliptical path component which beats with 
the damped Chandler resonance. The Chandler resonance, which has a 
period of about 420 days, is continuously excited to an average amplitude 
which is similar to that of the annual wobble. Conventionally, we reduce the 
annual component by direct subtraction of that elliptical path which best 
correlates with the sample-mean-reduced pole-path record; essentially, this 
obtains the best fit of the annual wobble in a least-squares sense. This mean- 
and annual-reduced pole-path record can then be represented by a linear sys- 
tems model as the convolution of the damped Chandler resonance transient 
with the “excitation pole function” . That is, in absence of any measurement 
error, the path as a function of time can be described by equation (1) in the 
form 



z{t) = c{t) * p{t), 
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where z{t) is the path of annual-reduced pole positions, p{t) is a stochastic 
excitation function and c(t) is the damped Chandler resonance transient. We 
use a right-handed coordinate system describing the pole path as a complex- 
valued time function with 



z{t) = x{t) + iy{t), 

where x{i) is the displacement of the pole from the C.I.O. along the Green- 
wich meridian (0°) and y{t) is the displacement along the 90°E meridian. 

The Chandler resonance function, c(t), must be causal and stable and 
should have the minimum-delay or phase characteristic. Therefore, there 
exists a causal, stable, minimum-phase inverse, (t), such that 

7(0 * c(t) = S(t) 

and we can deconvolve z(t) to obtain the excitation pole function (Smylie et 
aJ., 1970). If now 7 (t) is normalized such that 

b(t) = 6{t) - 'tit) 



with 

6(0) = 0, 

then we may describe a continuous analog to an autoregressive linear data 
model of the pole path: 



z{t) = b{t) * z{t)+p{t). 



Sampled without aliasing with interval At, this continuous analog reduces 
to the discrete, infinite-order autoregressive model approximation: 

Zn = K * *n + Pn (2o) 



or alternately 

In * Zn= Pn, 



(26) 



where the symbol * represents the discrete convolution operation. m = 
1,2,... is the infinite order autoregressive, one-step forecasting operator 
which is related to the autoregression or deconvolution operator as 



m = 0,l,...,OO, 



where is the Kronecker delta operator. In a notation more commonly used 
by statisticians in time series analysis, equations (2a,b) can be represented 



as 



</>(B)zn = Pn 



(2c) 
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where 

oo 

= 1 - 

t=l 

and where B is the one-step backshift operator (Anderson, 1976). For a 
finite-length sequence of observations 

n = 0,l,...AT 

and a finite-order autoregressive forecasting operator, 

hm\ m=l,2,...M, 

equations (2a-c) reduce to a systena of linear equations 

M 

^ ^ Pn 

m=l 

which may be rewritten in vector-matrix form 

z = Z • b -f p. (36) 

Properly, the Chandler resonance can be described by an autoregressive 
forecEisting operator of order M = 1 where 

6i = 

6o = 0, 

6,^ = 0; m > 1 



and 



u = wo +t’/r. 



where further, 

P = 2'k/wq 

is the period of the Chandler resonance and r its damping time constant. 

Geophysical theory provides that the actual excitation of the wobble will 
be that which possesses the minimum power whatever its self-correlation 
characteristics may be. Classical least-squares inversion for the Af-order 
autoregressive foreceisting operator obtains that estimate, b (with elements 
m= 1,2,..., M) of the vector, b, which is consistent with a minimum- 
variance or minimum-power excitation. That is. 



b = (Z*.P]^^.Z)-^-(Z‘.P^i).z, 



(4a) 
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where the symbol t indicates the complex-conjugate- transposition of the ap- 
propriate vector or matrix. The autocovariance matrix of the assumed zero- 
mean, stationary excitation having variance <r^ = E\pi • p^] is 

P = (45) 

with elements 

Pij = E[pi • p}]. 

Pn is the prior- assumed, diagonal-normalized autocovariance matrix of the 
excitation process which is the structural innovation in the autoregressive 
data model. The le2U3t-squares analysis also obtains estimates of the excita- 
tion scale- variance. 



^p"=P*-P]^'-M^-2M+l), (4c) 

where the estimated or deconvolved excitation vector, 

p = z - Z • b. [Ad) 

If we were to make the a priori assumption that the excitation process, 
equivalently the structural innovation of the linear data model, is stationary 
and uncorrelated, that is Pjv =1, the identity matrix, this solution would 
reduce to the so-called “exact least-squares solution” to the autoregressive 
data modelling problem which was introduced to geophysics by Ulrych and 
Clayton (1976). Our understanding of the Chandler wobble suggests that it 
would be much more appropriate to assume that the structural innovation 
has a flicker-noise form. This will allow for the greatest possible weighting 
of long periods in the excitation process while retaining true stationarity 
and a desired fractal character. In the relatively short records which have 
so-far been compiled, the long periods in the excitation appear as a slow 
trend. Practically, we are forced to assume that the process is bandlimited 
in a range of frequencies ft < \ f \ < fu where the upper limit is, at least, 
no greater than the Nyquist frequency, f^yg = 1/2 At, and the lower limit 
corresponds to a period sufficiently longer than the duration of the record 
under analysis. In the absence of any knowledge to the contrary, we further 
assume that the real and imaginary components of the structural innovation 
are not cross-correlated; then the Toeplitz autocovariance matrix, Pjv, will 
be real-valued. 

What we have essentially described in equations (4a-d), above, is a di- 
rect least-squares solution for a complex-valued time series modelled as an 
ARIMA(M,' 1/2,0) process. The equivalent half-order differencing (Granger 
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and Joyeux, 1980) arises directly through the substitution of the minimum- 
variance, correlated flicker noise in replacement of the commonly- assumed 
purely random innovation. In problems where the data-sets are either much 
longer or perhaps less valuable and unique than those being discussed here, 
the direct use of half-order differencing and standard autoregressive estima- 
tion might suffice. 

Using the theory described so far, Jensen and Mansinha (1984) analysed 
the BIH pole-path record of raw 5-day means during the period 1967.27 to 
1981.73 in an attempt to better determine the period and damping time 
constant of the Chandler wobble and to obtain, by deconvolution, the exci- 
tation pole path for its interpretation in terms of the known geophysical and 
geological events which had been recorded during this epoch. They com- 
pared inversions of the data set based upon the flicker-noise and white-noise 
assumptions. AR model order Af = 4 was indicated for this data set by the 
Akaike (1969, 1974) flnal-prediction error criterion applied to the classical 
model with a purely random innovation. The same model order was used 
in the flicker-noise inversions. The flicker inversion determined one signifi- 
cant long-period component with a period corresponding to 425 days and a 
Q = 32, which is equivalent to a damping time constant of about 12 years. 
This is the Chandler resonance. 

Much of the geophysical interest in the phenomenon of the Chandler 
wobble concerns its damping time constant since a long time constant (very 
high Q) would allow for its excitation to observed levels with low power. It is 
difficult to account for the excitation of the wobble by any known phenomena 
if its Q is less than, perhaps, 30 (Lambeck, 1980). Jensen and Mansinha 
(1984) introduced their data modelling theory in the hope of defining the 
high Q of the Chandler resonance. 

The standard AR(4) model with a solution obtained using the Ulrych 
and Clayton (1976) direct least-squares method resolved the Chandler res- 
onance (P = 429 days, Q = 28) but also determined a substantial, and 
geophysically unaccountable, resonance (P = —353 days, Q = 0.43) repre- 
senting a wobble in the negative rotation sense. Jensen and Mansinha (1984) 
recognized that this false wobble resonance was the result of an attempt by 
the cleissical AR data model to account for the preponderance of long periods 
in the data set as a property of the autoregression operator even though they 
are properly due to the properties of the stochastic excitation phenomenon. 
The classical AR model had, by implication, eiscribed false character to the 
geophysical system. Above, we showed that the Chandler resonance which 
should rest as the only significant period in these data following annual and 
mean reduction can be described by a purely autoregressive forecasting op- 
erator of order Af = 1. Any other periods which extend the order of the 
required best operator are false providing the description of our data model 
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is complete. Unfortunately, a major component of the BIH pole-path data 
has not been included in the theory so-far described here and by Jensen and 
Mansinha (1984). We have not accounted for additive measurement errors; 
we have described the model appropriate to a measurement-noise-free data 
set. 

Mea.surement noise in the data is revealed by the classical and flicker- 
noise AR models, just described, as equivalent low-Q resonances at shorter 
periods. The cleissical AR(4) model found two resonances to fill in the spec- 
trum of the data set: P = 105 days, Q = 0.66 and P = —148 days, Q = 0.44. 
The flicker-innovation AR(4) model found two similar resonances: P = 85 
days, Q = 1.2 and P = -135 days, Q = 1.2. The fourth period not yet 
accounted for in the AR(4) flicker-innovation model was found to be in- 
significant with a root magnitude of only 0.06. It is clear that resonances of 
larger amplitude are eissigned by the flicker-innovation solution to account 
for measurement error. This is because the flicker innovation is relatively 
deficient in short period spectral components as compared to a purely ran- 
dom innovation. Consequently, the AR model operator must further amplify 
these short periods in comparison to the white-innovation operator in order 
to All in the spectrum of the data set. Jensen and Mansinha (1984) recog- 
nized that both modelling procedures were incomplete because they could 
not account for a separately- additive measurement error. They proposed 
augmentation of the model described by equations (3a,b) as follows. 



6. THE ADDITIVE NOISE-AUGMENTED/AUTOREGRESSIVE 
DATA MODEL 

Only the signal component of the pole-path data must, by geophysical 
theory, follow the form of a structural autoregressive data model. The mea- 
surement error in the data are properly and separately additive. That is, 
the signal may be described by the AR model as follows: 

Sn — bn * Sn+Pny 

where the signal, is the difference between the observation, z^y and the 
measurement error, e^: 

- Cn. 

If, now, we reform the ‘‘apparent structural innovation” as 



n» 



qn=Pn + €n-hn * €, 
= Pn + 7n * eny 



(5a) 

(5fc) 
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we may employ an AR model for the data comprising both signal and addi- 
tive noise as follows: 

Zn = bn * Zn + q„. (6) 

We expect that the measurement error is stationary and uncorrelated, with 
zero mean and known variance, We believe that the excitation pole path 
is also stationary, with zero mean, variance, cr^, and is selfcorrelated like 
flicker noise. 

Requiring the autocorrelation function of the excitation pole path 



^Pk ~~ ^\Pn' Pn+fc]> 

to follow a flicker noise model, we determine that the autocorrelation func- 
tion of the apparent innovation described by equations (5a,b) has the form 

M 

4^qk ~ ^Pk "I" 7m7m+A;J A; = 0, 1, ... , Af, 

m=0 

= 4>Pk> k>M. 

Note that this autocorrelation function, from which the variance-covariance 
matrix required in a leEist-squares solution of the problem must be formed, 
depends upon the regression operator which is the subject of the inversion. 
A straight-forward iterative procedure for obtaining this operator is now 
described. 

Normalizing the complex- valued, Toeplitz variance-covariance matrix of 
the apparent innovation as follows, 

Q = (7a) 

we solve for a temporary estimate of the regression vector, 

b = (76) 

which minimizes the variance of the excitation power estimate, 

^p' = q*-Qw -q/(iV-2Af+l), (7c) 

and which essentially minimizes the variance of the apparent innovation 

determined by deconvolution. 



q = z-Z*b. {Id) 

Knowing the error variance, we recompute the variance-covariance matrix 
and solve again for the estimated regression vector and excitation variance 




GEOPHYSICAL SYSTEMS WITH FRACTAL FLICKER NOISE 



181 



and deconvolve for a new apparent innovation vector. We iterate until the 
excitation variance converges. 

For the problem solved here, we initialized the inversion with the solu- 
tion to the measurement error-free model described above and terminated 
iterations when the continuously decreasing estimate of the excitation vari- 
ance changed by less than 1 part in 10^ between successive steps. Then, 
reinitializing the inversion with an estimate of the excitation variance 1 part 
in 100 lower than that just found, and using the regression vector just found, 
we continued iteration until the continuously increasing estimate of the exci- 
tation variance changed again by less than 1 part in 10^ between successive 
steps. For a flicker noise excitation path, this procedure appears to be ro- 
bust; the eventual solution is quite unaffected by starting conditions and 
is insensitive to slight truncations of the pole path record. On the other 
hand this inversion procedure is not stable, often diverging after four or five 
iterations, when both the innovation and additive measurement error are 
assumed to be uncorrelated. 



7. POLE PATH SPECTRA 

The mean- and annual-reduced BIH pole-path record of raw 5-day means 
(1978-82) comprising 366 points and used in these analyses is shown in 
Figure 3. A classical AR data model, computed using the Ulrych-Clayton 
(1976) algorithm modified for complex- valued data (equations 4a-d with 
— /), showed a minimum final-prediction error (Akaike, 1969) at model- 
order M = 5. Since we have not yet extended the FPE or AIC criteria 
for model selection to allow for generally self-correlated innovations or for 
the additive noise- augmented AR model, we have used model order Af = 5 
throughout our several solutions following. Factoring the AR(5) operator, 
we may determine its complex- valued roots or zeros. Each root represents 
a damped harmonic component of the system which is stimulated by the 
excitation pole. We are especially interested in that component which cor- 
responds to the Chandler wobble. 

Figure 4 shows the conventional AR model spectrum of the BIH pole- 
path record obtained in the usual way by division of the squared Fourier 
transform of the computed regression operator, 7^, m = 0, 1 , . . . , M, into 
a white spectrum scaled by the computed innovation variance. The very 
sharp resonance {Q 69) of the Chandler wobble at 0.86 cycles per year 
(period 425 days) is the most evident feature of the spectrum. Geophysical 
theory predicts no other visible resonances in the spectrum; the background 
level corresponds to the BIH-reported level of measurement error in the 
data set. Because the innovation assumed in this data is not sufficiently 
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BIN POLE PATH (1978-1982) 




X 0."1 Greenwich Meridian 



Figure 3. The BIH pole path record (1978-82) following its mean and 
annual component reduction. The positive-phase sense of rotation is counter- 
clockwise, Straight lines join the data points which are separated by 5-day 
intervals, 

rich in low-frequency composition, the AR operator itself has been forced to 
account for the low-frequencies in the data set by exaggeration of the long- 
period Chandler resonance. We believe that this is the major reason for the 
otherwise attractively high Q found for the resonance. Moreover, this data 
model has not been found to be robust since the resonance frequency and Q 
obtained is quite sensitive to the removal of a few points from the beginning 
or end of the data set. For example, removing the last 10 points of 366 
results in a reduced ~ 54 and increased resonance period while removing 
the first 10 points results in an increased Q 73 and period. Finally, we 
know that the data model is incomplete and consequently, we have little 
confidence that the results reported here are geophysically meaningful. 

Figure 5 shows the equivalent AR spectrum under the assumption that 
the innovation process has a flicker noise form. Here, both the zero-frequency 
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Figure 4. The standard AR(5) wobble spectrum of the BIH data set of Fig- 
ure 8 as computed by the Ulrych- Clayton ^exact least squares method^. The 
measurements are assumed to arise from the excitation of the autoregres- 
sive operator by a purely random, minimum-variance process. No additive 
measurement error is considered. 




Figure 5. The spectrum of the modelled flicker-noise excitation or innova- 
tion. Here, the excitation has been scaled to a variance of 10 x arcsec^. 
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singularity of the assumed flicker excitation and the sharp {Q 10), sym- 
metrical Chandler resonance at 0.87 cycles per year are resolved. This data 
model allows for a lower Q resonance because the regression operator ob- 
tained via equations (4a-d) does not have to account for a preponderance 
of very long period composition. The known level of measurement error in 
the data set, reconstructed by the scaling of the flicker innovation by the 
remaining four zeros of the calculated AR(5) operator, is somewhat overes- 
timated. Figure 6 shows the spectrum of the prior-assumed flicker excitation 
which is the structural innovation for the latter data model. Results similar 
to those shown here were reported by Jensen and Mansinha (1984) in their 
flicker-noise AR modelling of a different data set. 

The new data model, described above (equations 6, 5a,b), accounts for 
a self-correlated structural innovation to the autoregression operator and a 
separately additive, stationary and uncorrelated measurement error. This 
model closely corresponds to our current geophysical understanding of the 
BIH data set. Assuming a flicker innovation to allow for the expected pre- 
ponderance of long period components in the excitation spectrum and a 
meeisurement error with the known variance, the spectrum shown in Fig- 
ure 7 was calculated using the newly-elaborated algorithm (equations 7a-d). 
The spectrum derived from this composite model W 2 is determined as the sum 
of a white spectrum corresponding to the known level of measurement error 
and the spectrum obtained by division of the squared Fourier transform of 
the calculated regression operator into the variance-scaled 1/ | / |-spectrum. 
Essentially, only the zero-frequency singularity of the flicker noise innovation 
and the Chandler resonance exceed the spectral density of the white mea- 
surement error background. Our attempt to use this same data model under 
the aissumption that the innovation was uncorrelated failed numerically after 
four or flve iterations. 



8. GEOPHYSICAL INTERPRETATION 

The noise-augmented, flicker-noise excited AR model decomposition de- 
termines only one period in the BIH record which has a power density sig- 
niflcantly in excess of that of the known level of measurement error. This 
component, with a period of 415 days and a Q 21, is obviously the Chan- 
dler resonance. It is interesting to note that this analysis allows for a shorter 
period of the Chandler resonance than is normally found using the standard 
procedures. Unfortunately, the Q appears to be too low. This latter failure 
is due, at least in part, to the unique character of this data set. Begin- 
ning about 1980, the record (Figure 3) shows an evidently rapid decrease 
in amplitude of the resonance. This could be due to a lack of excitation 
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Figure 6. The AR(5) wobble spectrum of the BIH data set of Figure S 
assuming a minimum-variance, flicker noise innovation. No additive mea- 
surement error is considered. 




Figure 7. The AR(5) wobble spectrum of the BIH data set of Figure 8 
assuming a minimum- variance flicker noise innovation in the presence of 
additive white noise representing the measurement error with variance 50 X 
10“® arcsec^ which approximates the average level of standard error in the 
BIH data set. 
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of the wobble during this epoch. If this is the case, we are observing the 
actual free decay of the resonance. Equally well, the wobble’s collapse could 
have resulted from a large asynchronous excitation which partially cancelled 
the wobble by interference. Moreover, one cannot discount the possibility 
that this apparent collapse of the wobble is only an artifactual result of 
the method employed in reducing the mean and annual components of the 
record. To resolve which of these possibilities holds, we require a longer and 
continually homogeneous record. A longer record (to December 30, 1984) 
has already been published by the BIH in its Annual Reports for the year 
1984. However, the continuing rapid improvement in the pole-position mea- 
surement technology has allowed the BIH to reduce the standard errors in 
measurement by a factor of 2 since the beginning of the 1978-82 data set 
used here. Our present method presumes stationarity and non-correlation 
of the additive measurement error. Adequate analysis of the BIH’s recently- 
published extended record will require further elaboration of the method 
to allow for non-stationarity of the errors. Presently, the standard errors of 
measurement represent about 1 part in 200 or so referenced to the magnitude 
of the pole position. If these errors could be further reduced by an order of 
magnitude, the noise-augmented data model would not be required in order 
to determine the Chandler resonance period and quality unequivocally. 

9. CONCLUSIONS 



We have presented, by means of a single geophysical example, an ar- 
gument for an elaborated structural data model which we believe to be 
appropriate for a wider class of problems in time series analysis. The geo- 
physicist or natural scientist can almost always draw upon his understanding 
and theoretical description of nature in construction of an appropriate struc- 
tural data model. All time series analysts are not so convenienced by their 
problems. Economic systems, for example, are evidently extremely complex, 
time variant and non-linear. No sufficient theory exists to describe almost 
any econometric time series and consequently one cannot hope to employ 
adequately elaborated structural data models in their analysis. Rather, and 
more appropriately, the time series analyst employs arbitrary models which 
he can only identify with respect to an optimum form and order subsequent 
to his analysis. We do not presume to criticize this conventional approach to 
time series analysis. As natural scientists whose major objectives are to un- 
derstand and explain nature, we are grateful for the continuing developments 
in statistical methods which we may adapt and bring to bear in resolution 
of our problems. We are perhaps only attempting to warn ourselves in our 
eagerness to attack our data by some fashionable statistical method without 




GEOPHYSICAL SYSTEMS WITH FRACTAL FLICKER NOISE 



187 



first carefully considering whether or not it is appropriate to our problem. 
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ON SOME EOF PROCEDURES 
FOR TESTING INDEPENDENCE 

ABSTRACT 

This paper is concerned with the use of the empirical characteristic func- 
tion (ecf) in nonparametric testing for independence. Properties of the ecf 
are briefly reviewed, and a new distributional convergence result (Theorem 
2.3) included. Nonparametric testing for independence is discussed briefly, 
but with particular focus on asymptotic aspects. Some new procedures for 
testing independence based on the ecf are presented and developed, and a 
Monte Carlo study carried out. The asymptotic efiiciency of the procedure 
is discussed and suggestions for further work and some open problems noted. 

1. INTRODUCTION AND SUMMARY 

In this paper we explore the use of certain characteristic function (cf) 
based quantities and their empirical estimates in the context of nonparamet- 
ric testing for independence in multivariate i.i.d. data. The key underlying 
idea is that since independence can be characterised by factoriz ability of a 
cf into the product of its marginals, the empirical cf (ecf) should provide an 
effective tool for testing the hypothesis Hq of independence. 

If Xj = (JYy , . . . , Xfy^ y = 1, . . . , n is an i.i.d. sample from a p- variate 
distribution whose cf is c(i) = where t = (ti, . . . , tp)', then the ecf 

is defined as 




y=i 



The properties of this estimator are now fairly well understood and for con- 
venience of the reader Section 2 below provides a brief review of certain key 
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results. (Theorem 2.3, however, is new.) In Section 3 we outline very briefly 
the current status of the nonparametric independence testing problem giv- 
ing particular focus to asymptotic aspects. Our ecf based tests are proposed 
and developed in Section 4, and a Monte Carlo study is presented in Section 
5. Section 5 also includes a discussion on efficiency of the tests and also 
itemizes some suggestions for further work. 



2. THE ECF 

Let Xj = (Xj , . . . ,Xy)', j = 1, . . . , n be a sample of a p- variate dis- 
tribution having cf c(t) and define the ecf as Cn(t) = ^ > where 

t =: (t^, . . The properties of Cy^(t) have been explored extensively 

by Feuerverger and Mureika (1977), Csorgo (1981a, b), Feuerverger and Mc- 
Dunnough (1981a,b), and references appearing in these papers, for exam- 
ple. It is easily seen that c„(t) is an average of n independent processes 
of the type and has mean £?[cy^(t)] = c(t). Defining the ecf process 

Yyj(t) = y/n(cn{t) — c(t)), the full covariance structure follows from the rela- 
tion 

cov(y„(s),y„(t)) = EYr,{s)Y;j(i) = c{s - t) - c{s)c{-t), (2.1) 

and from the fact that Yn{—t) = Yn{t), The following theorem records the 
three beisic and successively stronger consistency results. 

Theorem 2.1 

(i) Cn{t) — ► c{t) a.s. for all t. 

(ii) sup| 4 i|<r I c„(t) — c(i) |— ► 0 a.s. for T finite. 

(iii) In (ii) we may replace T by = exp(o(n)) and this result is the best 
possible, in general. 

Results (i) and (ii) are given by Feuerverger and Mureika (1977). Result 
(iii) for various suboptimal rates for Tn | is due to Feuerverger and 
Mureika (1977) and Csorgo (1981a,b). The sharp result is given by, for 
example, Csorgo and Totik (1983). 

The asymptotic distributional characteristics are given by the following 
theorem in which four successively more general results are provided. 

Theorem 2.2. Let y(t) be a zero mean complex valued Gaussian pr ocess 
with covariance structure identical to (2.1) and such that Y{—i) = Y{t). 
Then 

(i) yn(^i))Vn(^2), • • Converges in distribution to y(ti),y(t2), • • •, 

Y{tk) for all k and ti,t 2 ? . . G R^. 
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(ii) Yn{t) converges weakly to y(t) over any compact set in provided 
E II Xi ||^“*’^< oo for some ^ > 0. 

(iii) In (ii) the moment restrictions may be weakened to 

^;[log+ II II (log+log+ II ||)*+«] < 00, 

for some ^ > 0. 

(iv) Yn{t) converges weakly to F(^) over compact sets if and only if 

r 

Jo h{\og^y/^ 

where a(/i) is a non-decreasing rearrangement of the function cr{t) = 
(1 — Re c(t))^/^. Specifically, if 

^{y) = ><d{t : IM II < ^, ^(0 < y}, 0 < y < 1, 

where is a d-dimensional Lebesgue measure, a is its inverse: 

W{h) = sup{y : m(y) < h}. 

Results (i) and (ii) are given by Feuerverger and Mureika (1977). Results 
(iii) and (iv) lie deeper. The logarithmic moment condition is given by Keller 
(1979) and is nearly sharp. The sharp result is from the work of Csorgo 
(1981a, b) and Marcus (1981). Note a(h) is defined on [0, m(l)] and has 
there the same distribution with respect to Ai as does a-(t) on || t || < | with 
respect to A^. Feuerverger and McDunnough (1981a) have observed that 
actual weak convergence of the ecf is generally not required in statistical 
contexts. Indeed, results of the following type may be established. 

Theorem 2.3. The finite expression 

Aq+ f Yn{t)dAi{t) -\- f Yn{ti)Yn{t2)dA2{tiyt2) 

Jrp Jr^p 

+ •••+ f Yn{ti) • • •Yn{tjc)dAk[ti, . . .ftk), ( 2 . 2 ) 

where each of the functions Af(ti, . . . , t^) is of bounded variation on / = 
1, 2, . . . , A;, converges in distribution to the expression obtained from (2.2) by 
replacing the terms Yn{') by the Gaussian process Y{'). 

Proof. Denote the expression in (2.2) by Q{Yn) and let Yj^,Y^ be Y^ 
and Y except based on truncating the X variable at ±M. Then for fixed 




192 



ANDREY FEUERVERGER 



M < oo, Q{Y^) Q{Y^) by Theorem 2.2. We may also readily establish 

that Q{Y^) Q{y) as M — > oo and finally that Q{Y^) Q{Yn) as 
M —> oo uniformly in n giving Q{Yn) Q(y) zis required. 

Turning to statistical contexts, a significant motivation for the study of 
ecf procedures is the efficiency result for inference based on the ecf due to 
Feuerverger and McDunnough (1981a, b). This result essentially is as follows. 

Theorem 2.4. Suppose Xj = (Xj, . . . , X^Y, y = 1, 2, . . . , n are i.i.d. with 
cf ce{t) where t = (t^, • • • , t^)\ and $ = (^i, • • * > ^qY takes on some true value 
^0 in © C Let ti, ^ 2 j • * * > he a fixed grid of k points in such that 
tj ± ti ^ 0 ioT j ^ I and define the 2fc-dimensional column vectors Z$ and 
Zn as 

^0 = {c${-tk), • • • , c^(-ti), c^(ti), . . . , ce{tk)Y 



and 

Zn = {Cn{'-tk),...yCn{-ti), Cn(ti) , . . . , C„(tfc))'. 

Define the matrix = cov{y/nZn) whose entries are given by (2.1) and 
consider estimating 9 by fitting Z$ to Zn using generalized non-linear least 
squares, i.e., minimizing 



{Zn - Zs)' - Zo), 



where S is a consistent estimator of Then under mild conditions that 
ensure asymptotic optimality of the MLE the estimation procedure above is 
consistent, eisymptotically normal, and asymptotically has covariance which 
may be made arbitrarily close to the Cramer-Rao bound by selecting the 
grid {tj} to be sufficiently fine and extended. 

The procedure of Theorem 2.4 is termed harmonic regression and is only 
one of several efficient procedures given by Feuerverger and McDunnough 
(1981a,b), where a fuller discussion of regularity conditions may also be 
found. 



3. NONPARAMETRIC TESTS FOR INDEPENDENCE 

In this section we give a very brief overview of what has up to now been 
accomplished in the literature on the nonparametric independence testing 
problem. However, our focus is primarily on susymptotic aspects and some 
related open problems are noted. 

In the context of bivariate normality, optimal inference concerning in- 
dependence is based, of course, on the Pearson product-moment correlation 
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coefficient with the resulting procedure being both UMP unbiased and UMP 
invariant (Lehmann, 1959, § 5.11 and problem 6.11). See also Anderson 
(1958, Chapter 4), for example. For departures from normality, it is some- 
what remarkable that the robustness of the correlation test for bivariate 
independence is still not completely understood, although evidence of non- 
robustness is mounting. See, for example, Sri vast ava and Lee (1984) and 
related references. 

Many rank tests of independence (or non- association) have been pro- 
posed, in general for motivations related specifically to robustness consider- 
ations. If (Xt,Yi), t = l,'--,n is a bivariate sample, these tests are based 
on the data only through the collection of ranks (!/», Vi), i = 1, • • • , n where 
Ui (Vi) is the rank of Xi (Yi) amongst the X'a (y's) respectively. Well- 
known among such rank tests are Spearman’s rank-correlation, Kendall’s 
tau, Fisher- Yates’ normal scores and the quadrant test, among others. See, 
for example, Hollander and Wolfe (1973, Chapter 8), Lehmann (1975, Chap- 
ter 7) and Kendall (1970). Table 1 gives the known asymptotic (Pitman) 
efficiencies of various rank tests of bivariate independence relative to the 
sample correlation coefficient test in the context of the bivariate normal 
family. The full asymptotic efficiency in this case of the normal scores test is 
of special interest. For bivariate families other than the normal, it is known 
that the asymptotic efficiencies relative to the correlation can in general be 
both greater and less than unity, depending upon the particular family. (See, 
for example, Stuart (1954) and Konijn (1956).) 



Table 1. Asymptotic efficiencies of nonparametric tests for association 
relative to correlation in the bivariate normal family. 

Test A.R.E. 



Fisher- Yates normal scores 1.00 

Spearman Rank correlation .91 

Kendall Tau .91 

Hoeffding Bn .78 

Quadrant test .41 



The A.R.E. values shown are Pitman efficiencies except for Bn, which is the 
limiting Bahadur efficiency. (Wieand (1976) has given general conditions 
under which these efficiency measures coincide.) 
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The nonparametric tests of association so far discussed share one sig- 
nificant flaw, namely they are not consistent for many typical nonparamet- 
ric classes of alternatives. (Indeed, examples can be readily constructed 
where the tests are not unbiased or even asymptotically unbiased.) It is the 
case, however, that these tests are generally consistent against alternatives 
involving dependence structure which is, loosely speaking, of “monotone 
character” (eg., Lehmann, 1966). And since it is this type of dependence 
which is characteristic of many typical applications, these tests have found 
widespread use. Nevertheless dependencies of less direct character do some- 
times occur — as, for example, when Y hsis a non-monotone regression on X, 
and X is sampled randomly. In any case, the lack of consistency is, at least 
from a theoretical standpoint, unsatisfying. 

There are, however, two exceptions in the literature of which we are 
aware. The first is the distribution-free (under Ho) rank test proposed by 
Hoelfding (1948) and by Blum et aJ. (1961) hosed upon 

Bn = / / [Bn(®,y) - Bn(®)B„(y)]^df„(x,y), 

while the second is due to Rosenblatt (1975) and is based upon 

y [/n(®, y) - (y)]*o(®, y)dxdy, 

where the are suitable kernel density estimates, and a(x, y) is a weight 
function. Rosenblatt points out, however, [ibid, Section 3) that the asymp- 
totic distributions of Tn are generally normal, while those of are normal 
only for the alternatives, and under independence has the less familiar dis- 
tribution of an infinite weighted sum of chi-squared variates. Nevertheless, 
Rosenblatt indicates that tests based on the sample distribution function are 
typically more powerful than those based on density estimates. Incidentally, 
observe that is a rank test, while T„ is not, although rank versions of Tn 
are easily constructed by replacing the observations by rank functions prior 
to computing the densities. In any case, neither B^ nor is asymptotically 
efficient for normal alternatives. In fact, B„, is shown in Table 3.1 as having 
an efficiency of 0.78. 

From the discussion above, it emerges that no test of dependence or 
association appears to be known which is both consistent in general and 
asymptotically efficient in the normal case and that to obtain such a test 
is an open problem of some interest. Now a resolution of this problem is 
certainly possible along the lines of a two stage procedure in which the first 
stage consists of a consistent test sequence for bivariate normality with levels 
tending to zero at a suitable rate, and the second stage then uses either the 




TESTING INDEPENDENCE 



195 



normal scores test say, or the HoefFding test — depending on the results at the 
first stage. However, a less artificial resolution of the question, not involving 
such multiplicity would certainly be of special interest. 

4. AN EOF TEST FOR INDEPENDENCE 

The results of Section 2 suggest the study of ecf ba^ed procedures for 
testing independence. To reestablish notation, recall our sample consists of 
n i.i.d. p-variate terms (Ay , . . . , Xy)', j = l,2,...,n from a distribution 
having cf c{t) where t = At issue is the mutual independence 

of the terms in (Aj, . . . , Af)'. The ecf, as usual, is Cn{t) and we introduce 
now the marginal quantities 



c'(t') = c(0,.. 






(4.1) 


and 




fl 




II 




n 


(4.2) 


for / = l,2,*",p. Self-evident notations such as 


cXi 



yt^) and so on may also be used. The underlying null 
hypothesis of independence may be expressed as 

p 

1=1 

Now the central quantity in ecf based tests for independence is 

r(t) = c(t)-nc,(t‘) (4.3) 

1=1 

and its asymptotically normal empirical counterpart 

r„(t) = c„(*)-ncUi'). (4.4) 

1=1 

We know from Theorem 2.1 that F„(t) — > T{i) a.s. for all t ^ (and indeed 
uniformly over compact sets) and that the independence hypothesis may be 
expressed Hq : T{t) = 0, t E R^. Note, however, that it is not the case 
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that ETn{t) = r(t). For example, in the two-dimensional case (p = 2) we 
have 



E{cUt^)cUt^)] = E 



1 ” 1 ” 

i y- . 1 y- 

n ^ n ^ 



^ yi=i 



J 2 = l 



= ^ 2 (4-5) 



so that 

E[T„{t\t^)\ = ^T{t\t% (4.6) 

Thus, our null hypothesis can be expressed here as Hq : ETn = 0, but this 
is the case only for p = 2 . 

To express the general result we define the following quantities: 



= (t‘*s l<s<p, (4.7) 

0=1 

where the sum is over all partitions of ( 1 , 2 , . . .,p) into s sets (*ii, t’ 12, . . . , 
ipj, (t2i,t22, * . • ,*2pJ, • . • , (*4»i,*a2, • • • ,*apj- Note that Vi = c(t). We have: 



Theorem 4.1. For arbitrary p > 1, 

= (4.S) 

«=1 

= r(f) + o(l), (4.9) 

n 

where the order term is uniform in t. Under Hq^ ETn{t) = 0 all t, n and 
conversely. 



Proof. (4.8) follows upon examination of the terms arising in 

£(cl«‘) • ■•<«')) = ; E 

Tl 7 l 

31=1 3o = l 



f4.10l 



and, noting that Vp = c^(t^) . . .c^(^p), (4.9) follows from (4.8). The last 
assertion now follows easily. 

In the cases p = 3 and p = 4, for example, the theorem gives, in obvious 
notation 



iprr« /’*! ^2 *2\l _ ^^123 l)r..l2^3 , ^13^2 , ^23 ^li 

E[Tn[t ,t )\- - - -3 -C ^^^3 [c C 4 -C c +c cj 

n(n- l)(n- 2 ) 123 

; c c c , 



(4.H) 
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- rfcll [c123^4 ^ ^124^3 ^ ^134^2 ^ ^234^1 ^ ^12^34 ^ ^13^24 

n* 

+ c^"c=“®] 

- !^(!lziK!LZ_?) [ci^c^c-* + ci3c=*c“ + 

n* ^ 

+ c^*cW*ch^] - »(»-1)(>^-2)(»-3) ^i^2^3^4 (4 12) 

To construct our test statistics for Hq based on Tn{t) we require the covari- 
ance structure of Tn{t) under Hq. The general result is as follows: 

Theorem 4.2. Under the null hypothesis Hq of mutual independence the 
covariance structure of 



r„(0 = c„(<)-fl‘^n(*'). 



is determined by 



cov(r„(s),r„(f)) 



i _ i- 

n nP 



HcV-t') 



n-l A ^ 1)*’ ’ 



nc‘(aV(t‘)-E- 

1=1 q=l 



V (4.13) 



w, = ^ n • n (4-14) 

leti 

where fi ranges over the combinations of q indices selected from 

(1, 2, . . .,p). The asymptotic structure is given by 



lim n- cov(rn(s),Fn(0) 



:^nc'(3‘-f')+(p-i)riA3V(t‘) 
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‘=1 



U=i 
p 



1=1 



- n (4.15) 



1=1 



m= 1 
n^l 



Proof. Under Hq, £^r„(l) = 0 so that 



cov(r„(s),r„(t)) = £(r„(s)r„(0) 



= £;[c„(s)c„(t)] - E 



<=nW-fl‘=‘n(*') 



- E 



Li=i 



+ E 



1=1 

p 






Lz=i 



1=1 



. (4.16) 



We next replace the terms c„(*) and cjj(*) by their defining summations and 
find that the four terms of (4.16) consist respectively of n^, and 

products of exponential terms. Upon further examination we find that 
each of the -|- expectations now arising must be of the 

type 

nc‘(e'-t‘)-nW(‘‘). (4.17) 

where /i is a subset of q elements of (1, 2, • • • ,p), g = 0, 1, 2, • • • ,p. We shall 
refer to the q terms in the first product as being linked and the p — q terms 
in the second product as being unlinked. 

We shall now collect up, for each term (4.17), the coefficients that are 
contributed to it by the four expressions in (4.16). For the case where all 
terms are linked we have 



n 

nP+i 



n nP _ 1 
nP+i ^ “ n 



nP‘ 



For the case where all terms are unlinked we have 



n(n — 1) n(n — 1)^ 

n2 

_ n — 1 
n 



n(n - 1)P nP(n- 1)P 
nP+i ^ n^P 




(4.18) 



(4.19) 
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In the remaining ceises, the first term in (4.16) does not contribute. Each 
term involving q links, 0 < g < p will have the coefficient 

^ n(n - 1)P-^ n(n - 1 )p~^ nP(n - 1 )p~^ _ (n - 1)^"^ 

We have now established (4.13). The asymptotic expression (4.15) now 
follows directly upon taking the limits indicated; note that of the terms 
(4.13) only coi survives €isymptotically. 

We remark that for dimension p = 2 Theorem 4.2 gives 

cov(r„(s),rn(0) 

= - t^) - - t^) - (4.21) 

but this attractive form is not preserved for higher dimensions. In any case, 
however, the value of cov(rn(s), Fn(t)) depends only upon the quantities 
c^{s^ — t^), c*(s^, c^{t^)y I = 1>2, ••*,p and is thus readily determined. In 
the Izist of the two forms for the asymptotic expression (4.15) we may note 
that the quantities in square brackets are the covariances of the ecf itself 
in the p- variate ceise with the univariate cases. This is due to the fact that 
the asymptotic expression may be derived using a differential argument by 
expanding out 



r„(0 = (c(t) + [c„(t) - c(t)]) - n(c‘(i‘) + - c'(t')]), (4.22) 

1 = 1 



and dispensing with products of the squared bracket terms which are of 
higher than first order. We remark also that the general covariance struc- 
ture, without the independence assumption, may be derived by the methods 
of Theorem 4.2. Those results, which are somewhat more complicated, are 
not quoted here. We observe as well that the process Fnlt) will inherit 
certain weak convergence characteristics from the ecf which we do not pur- 
sue here. It is enough for us that for k fixed points ^i, • • • ? we have 
that rn(ti),r,»(t 2 ), • • •,r‘n(^fc) is asymptotically Gaussian with mean and 
covariance structures as given by Theorems 4.1 and 4.2 

We may now describe our ecf beised tests for independence as follows. 
First we select k fixed p-vectors ty, j = 1,...,A; such that the 2k points 
— • • •, — • • - ytk are all distinct. Next we form the 2A;-dimensional 
vector I = (r„(-tfc),...,r„(-ti),r„(ti),...r„(tifc))'. According to Theo- 
rem 4.1 we shall have = 0 under Hq^ and otherwise ^ 0 for some 

selection of the tj. Now the complex variance structure for | is known 
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under from Theorem 4.2 and indeed, we may use the asymptotic form 
(4.15). Further, we may estimate as by replacing all cf’s by their 
respective ecf ’s. Our test will then consist of rejecting Hq for large values of 
the test statistic (here * is conjugate transpose) which has, under 

Hq, asymptotically a chi-squared distribution with 2k degrees of freedom. In 
accordance with the usual multivariate theory, this test is asymptotically the 
uniformly most powerful invariant test based upon the statistics F„(±ty). 

We proceed now to demonstrate heuristically that the test procedure 
which has been proposed above inherits certain asymptotic optimalities by 
virtue of Theorem 2.4. Our sole concern for the moment is that Theorem 
2.4 essentially assures that asymptotic efficiency can be arbitrarily closely 
attained by restricting attention to a sufficiently dense collection {cyj(ty)} 
of the p-variate ecf, while the test proposed here is based on Tn and not 
on Cyj. We thus seek to establish that this has not compromised potential 
asymptotic optimalities. 

Let us thus suppose that we wish to test Hq using some collection 
Cn{tj), j = 1, • • • , A; of values of the p- variate ecf, together with their asymp- 
totic Gaussian distribution in a manner that will not sacrifice the asymptotic 
information in these statistics. Now the vector of expectations E Cn[tj) takes 
values in general in the complex space while Hq states that this vector 
lies in the appropriate “unit rank” subspace M C consistent with the 
fact oriz ability of c{t) under independence. Thus, asymptotically we may 
view this as a problem in generalized non-linear least squares where the er- 
ror structure may be regarded as being Gaussian with known covariance. 
Strictly speaking, the spaces and M should be further confined by per- 
mitting only points consistent with Bochner’s characterization theorem, but 
not doing so does not affect the argument because asymptotically we find 
ourselves within permissible neighbourhoods. We would thus obtain in this 
way asymptotically UMP invariant tests among those based on the Cn{tj). 

Now let us compare these to our F^i-based tests. Note that involves 
not only the Cn{tj) but also the marginal quantities Cn\t^p)y / =, . . .,p. In 
terms of the ecf the restricted model specifying Hq may be written as 

= aj. 

= n 

1=1 

while in the unrestricted case the expectations are essentially free. The 
likelihood ratio test of Hq here is the asymptotically UMP invariant test 
based on the chosen points of the ecf. On the other hand the test which 
was proposed based on the F^ is seen to be the Wald’s test (see Wald, 
1943; Rao, 1965, Section 6e.3) and so is asymptotically locally as powerful 



/ 1, . . . ,p; j — ly ... 

j=ly...yk, 



(4.23) 
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as the likelihood ratio test. We caution, however, that while this appears 
satisfying, it does not translate automatically to optimality for the general 
nonparametric testing problem; further discussion appears in the section 
following. 

Concerning the complex variable format of the test we remark 

that this is preferred to working with the real variables 

U = (Rer,(^i), . . . , ReRMY and V = (ImP^Cti), . . . , Imr„(4))', 

since the form of Tiu,v (which is readily obtained from is less natural 
than that of In any case the two tests will be identical: 




To see this note that (^) = for some nonsingular A and the two expres- 
sions in (4.24) are each identical to 

Finally we remark that while the grid {ty} is being regarded here and 
throughout as fixed, finite as well as being fine and extended, in practice 
we may wish to use only a moderate number of data-dependent gridpoints. 
Both numerical experience (see Feuerverger and and McDunnough, 1981b) 
and heuristic arguments suggest no practical difficulties in this although a 
rigorous demonstration is not within our scope here. 



5. A MONTE CARLO STUDY AND FURTHER REMARKS 

The general method proposed in the previous section raises many 
questions — more than can be resolved here. In this section we undertake 
two specific investigations. The first is to carry out a numerical implemen- 
tation of the procedure and a modest Monte Carlo simulation to demonstrate 
the effectiveness of the method in a simple case where the dependence is of 
a non-monotone character. The second is to examine the asymptotic rela- 
tive efficiency of the method when the underlying family is bivariate normal. 
Some general remarks concerning applications and suggestions for further 
work are also made. 

Numerical implementation of the procedure involving the test statistic 
|*E-i| as described in the previous section was carried out in a FORTRAN 
program developed on the University of Toronto IBM 360/370. The inversion 
E”^ is based on IMSL subroutine LEQ2C. (The program may be obtained 
from the author by request.) Our Monte Carlo study was based on i.i.d. sam- 
ples (A^, Yi), t = 1, 2, . . . , n from a uniform distribution on the unit circle in 
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. The points at which the statistic were computed were chosen rather 
arbitrarily €is = (1)1), ^2 = (1>~1)) ^3 = (2,2), <4 = (2,-2) and the 
resulting test statistic was computed and compared to the upper 5 percent 
point of the chi-square distribution with 8 degrees of freedom. One hundred 
trials were conducted at each of the sample sizes n = 100, 200, 400. Table 
2 shows the number of significant trials in each case. Our test evidently is 
effective and consistent in the present case. For comparison purposes, con- 
sider, for example, the Fisher- Yates test. By symmetry, the correlation of 
the normal scores is estimating zero. Consequently the resulting test will 
not even be consistent. Similar remarks apply to other rank tests designed 
against monotone dependence. The Hoeffding test, of course, will be con- 
sistent; however, we have not attempted any power comparisons here as the 
question of how best to select the points ti,. . in practice is not fully 
resolved. For a Monte Carlo study of related interest, see Koziol and Nemec 
(1979). 



Table 2. The proportion of trials (out of 100) significant at 5% for samples 
of size n from the uniform distribution on the unit circle. 

n 100 200 400 



proportion .42 .97 1.00 

significant 



Turning now to the question of efficiency for Gaussian samples we obtain 
mixed results. On the positive side we have the following: 



Theorem 5.1. In the case A: = 1 where the test procedure is 

based on a single point ti = (si,« 2 ) and the underlying family is Gaussian, 
we have (under suitable interpretation) asymptotic efficiency (relative to 
the Pearson correlation test) arbitrarily close to unity as ti tends to (0,0) 
without touching the axes. 

To see this result, note that as (si,S 2 ) (^>0) we have, on Taylor 

expanding. 



r»(«i,S2)= 

y=i ^ ^ ^ 



= -Si«2 • COV^^i(X,,Yj) + o(siS2), (5.1) 

so that the F^ statistic becomes equivalent to the covariance. In this case 
Tn and Fn become identical within o(si,S 2 ) so that the 2x2 matrix 
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tends to singularity. The appropriate limiting adjustment is to use gener- 
alised inversion, or equivalently (Feuerverger and Fraser, 1980) to reduce 
the procedure to a single real statistic and chi-square distribution with one 
degree of freedom. 

On the other hand, if we use additional points [k > 1) of the ecf, then 
in the case of Gaussian samples some loss of efficiency generally occurs. To 
illustrate this point let {Xj^Yj) y = 1, . . . , n be an i.i.d. sample from the 
bivariate normal distribution iV^^((o)» (J J))* The null hypothesis here is 
Ho : p = 0. We shall first attempt to examine asymptotic relative efficiencies 
defined 3S the limiting ratio of approximate Bahadur slopes as p — > 0. (See 
Bahadur (1960) and Wieand (1976).) Here the optimal procedure is based 
of course on T\ = Now as a “second point” we introduce, 

say T 2 {t) = ^ sin^Yj. The test procedure requires that we 

compute the covariance matrix of Ti and T2{t) under Hq. This may be 
done using trigonometric identities, the formula for the cf of the Gaussian 
distribution, and substitutions such as a; = lim«_^oo(sinsx)/s and = 
lima_^o2(l — cossx)/s^. The result under Hq is 




Our test statistic is, in fact, Q = n(Ti,T 2 )SQ ^(Ti,T 2 )'. When p ^ 0 we 
have Q = n • q{p) + Op(n), where q{p) = {E^Ti, £pT 2 )So ^(£7pTi, so 

that using Lemma (2.4) of Gregory (1980) we may compute the approximate 
Bahadur slope in this case: 



c(p;Ti,T 2 (t)) =• lim 



log P{Q > observed value ) 

n 



(5.3) 



where P is the asymptotic distribution of Q under Hq. Thus 



c{p',Ti,Ti{t)) = lim 



logP(xi > nq{p) 

n 



= 9{p)- 



(5.4) 



By a similar computation we find that the approximate Bahadur slope for 
the test based on T\ alone (which is, in fact, the optimal test) to be 

c{p-,Ti)^p^ (5.5) 



SO that the ratio of approximate slopes is 



c 



jp,Ti,T2{t)) _ q^ 

c(p-, Ti) p^ 



e{p,t) = 
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Now inverting Eq we find 
[|(1 - 

where, after some computations 

E,T^{t) = ^e-^"{e'’*" -e-»*"]. (5.7) 

The limiting p -> 0 Bahadur approximate efficiency may now be obtained 
and is 

eo(0 = lime(p;f) = 1, (5.8) 

p -*0 

independent of t. Unfortunately this is not the complete picture. Firstly, 
for /9 ^ 0 we have typically e{p;t) > 1 which is unsatisfactory. Secondly, the 
condition III* of Wieand (1976) can be verified in this ceise so that eo(t) can 
be interpreted as a Pitman efficiency, however, the Pitman concept appears 
not to apply in comparing chi-square tests with unequal degrees of freedom 
(see, for example, Gregory, 1980, section 3.) We thus have here a case where 
the limiting ratio of approximate Bahadur slopes has no known meaning 
and which further illustrates the rather tenuous relationship between the 
exact and approximate Bahadur concepts (Bahadur, 1967). The comparison 
between the tests is therefore not easily carried out theoretically but could 
certainly be studied by Monte Carlo methods. In any case, the optimal test 
(in the Gaussian case) is that which rejects for large Ti whereas the “two 
points” test rejects for large (Ti, T 2 )Sq ^(Ti, T 2 )' which is a full rank fixed 
quadratic form in (Ti,T 2 ) and clearly somewhat less than fully optimal; 
the efficiency loss will typically be small, but clearly a rather complicated 
function. 

In addition to those raised in the analysis above, some further open ques- 
tions emerge that seem worthwhile. Firstly, if inference concerning general 
dependence is to be centered on the statistic 

r„(s,t) C^'^{s,t) - c^{s)cX{t), (5.9) 

then we remark that it may be worthwhile to consider first replacing the Xi 
and Yi by their normal scores (or van der Waerden scores) to help standardize 
testing considerations. Note that c^{s) and c^{t) will now be fixed so that 
the Tn covariance function must be adjusted, however, under Hq we will 
now have that is distribution free. We remark also that continuous type 
test statistics such as 

J j I r„(s,f) 1 ^ «;(s,t)tfs<it, 



5.6 



(5.10) 
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(or more general quadratic forms) yield fruitful procedures which are con- 
sistent and therefore will be competitive to and may be compared with the 
Hoeffding statistic. Finally, we remark that the multivariate independence 
testing problem has a time series analogue. In fact, the testing for indepen- 
dence of a stationary process may be used as a tool in fitting time series 
models, an application which we plan to pursue in a subsequent work. 
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ARE ECONOMIC VARIABLES 
REALLY INTEGRATED OF ORDER ONE? 

1. INTRODUCTION 

Many macro-variables have a fairly smooth appearance over long time 
periods. This smoothness can be translated into functions commonly con- 
sidered by time-series analysts as: 

(i) the series have a spectrum with a large peak at low frequencies, called 
the “typical spectral shape” (or TSS) by Granger (1966), 

(ii) the correlogram declines very slowly as lag length increases, and 

(iii) the differenced series will have a correlogram that is explainable using a 
stationary ARMA model. 

These properties will be called the TSS properties. Some economic series 
may appear to need to be diflFerenced more than once, but I will not discuss 
that C2ise in any detail. 

It was pointed out by Box and Jenkins (1970) that a first-order- 
integrated process (denoted /(!)) or more specifically the ARIMA (p, l,g) 
process has the TSS properties. Such a process produces a stationary ARMA 
series after differencing. It has become common to equate the two facts and 
conclude that many economic series are actually /(l). In this paper I will 
raise the question posed in the title but will not reach a definitive answer, 
as the necessary empirical work hos not yet been undertaken. Rather, I 
will note that there are some problems in matching economic reality with 
the 7(1) model and suggest some alternative models that circumvent these 
difficulties. I also point out that there is a wide claiss of models that pro- 
duce series with the TSS properties, particularly once the possibility of a 
deterministic trend is taken into account. 

A formal definition of an 7(1) series, denoted by Xty is that it has the 
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property that 



— yt> 



(1.1) 



where y* is stationary. An example is the simple random walk, where y* is 
just a zero mean, white-noise series, denoted et> so that in this case 



Xf Xt—\ — 6t. 



( 1 . 2 ) 



To generate such a series, one merely needs the inputs yt plus a starting x 
value. 

A simple but interesting and instructive example is where 



yt = m -f- €t, 



where €t is white noise as above. Now (1.1) is a model generating a series 
known as a random walk with drift, in the case when m ^ 0. If the process 
starts at time t = —N — 1, with the starting value then 

t 

Xt = {N + t)m+ ^ -f- (1.3) 

j=-N 

For convenience, it will be 2issumed that x_tv-i = 0. It is convenient to 
rewrite (1.3) 2is 

Xt = h{t) + Sty 



where 

h{t) = [N + t)m 

is the deterministic trend in mean, and 

t e. 

j=-N 

is the zero-mean, stochastic component of Xty having variance 

F(^) = (^■^-^)(7^ 

where = var(et). For ease of exposition it will be assumed that the 
variance of e* is constant, although in practice this does not necessarily 
occur. I will generally take a white noise series merely to be an independent 
sequence. It is frequently claimed that a random walk has infinite variance. 
This will be achieved only if m = 0 and then letting N oo. It should be 
noted that this limiting procedure cannot occur if m 5?^ 0, as then the level 
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of the series would be at infinity. The importance in practical terms of there 
being an actual starting time is clearly seen, and replacing €t by any zero- 
mean stationary series makes no difference in this analysis. Thus, rather 
than assume an infinite variance for a random walk, it is more realistic to 
assume that both mean and variance of Xt are linear functions of time. The 
actual time at which the process starts is not particularly important; what 
matters is that it started a finite time ago. 

It should also be noted that 



E [stst-fc] = -i-t- k) 



so that 



pk = corr (xt, xt_fc) 

N + t-k 



[{N + t){N + t-k)]^'^ 



'N + t-k 
N + t 



1/2 



which is nearly one if AT -f t is large compared to k. It is thus seen that this 
is an /(I) series, starting up a finite time ago, will have the TSS properties, 
at leeist to a high degree of approximation. 

The basic evidence that many economic macro- variables are /(I) is pro- 
vided by the many model identifications using the Box-Jenkins interpretation 
of the correlogram. Actual series often have trends in mean and autocorrela- 
tions that are large for many lags. As /(I) series also have these properties, 
it is common practice to conclude that actual series ae /(I), but as will be 
shown this is not necessarily the correct conclusion. In the next section, var- 
ious other linear models having at least approximately the TSS properties 
are discussed. In the following section, the fact that many economic series 
are inherently positive will be faced and some non-linear models considered. 
The paper finishes with a general discussion and conclusions. 



2. AN INTRODUCTION TO TRENDING SERIES 

If /ij ,y = 0, 1, 2, . . . is a sequence of positive constants, a series Xt may 
be generated by the application of a sequence of increasingly long filters 
applied to the input series et, giving 

t 

Xt = 

i=o 



( 2 . 1 ) 
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Taking 

et = m + et, 

where Ct is i.i.d., with zero mean and variance one gets the decomposition 

xt = mh{t) + Sti 



where ^ 

i=o 

and 

t 

8t = 

3=0 

is a zero-mean stocheistic component with variance 

3=0 



The sequence hj is said to form a divergent series if h{t) continually increases 
without bound, in which case mh{t) will provide a deterministic trend in 
mean provided m ^ 0. Further, the sequence may generate a trend in 
variance if V (t) is an increasing, unbounded sequence. Clearly, the theory 
of divergent series may be used to analyze these functions. As an example, 
consider the sequence hj = 1/j, then h{t) tends to logt and V{t) tends to 
a constant. Thus, in this case, the filters generate a trend in mean, but not 
in variance. For a second example, if hj = then h{t) tends to cit^ and 
V'(f) tends to where ci,C 2 are constants. 

If hi{t)y h 2 {t) are a pair of trending functions, then hi{t) is said to 
dominate h 2 {t) if the ratio 

Ht) 

— ► zero as t — ► oo. 

hi(t) 

Thus h{t) is a trend if it dominates any constant. A number of features of the 
trend generating filters are considered by Granger (1985), which concentrates 
on the dominant components of any trend; it is shown that a wide variety of 
trends can be generated, the effects of trend-reducing filters are determined 
and the corresponding spectral shapes at low frequencies are discussed. 

A particularly interesting class of filters arise from considering functions 
g{B) of the lag operator B such that p(l) = 0. Let 



1 

m 



= T,93B^ 



3=0 
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and now take hj = Qjy jf = 0, 1, . . .,t. The resulting filters generate trends 
and the stochastic part St may be called generalized integrated. Examples 
include the fractional integrated processes with y(J5) = (l-JB)® where c may 
be a fraction, as considered by Granger and Joyeux (1980), Hoskins (1981) 
and others. If Xt is /(d), d a fraction, then the trend in mean produced is 
dominated by for some constant c. 

Many of the series generated by the filters discussed in this section will 
have the TSS properties, at least approximately. The series will usually have 
trends in mean and variance, but the differenced series are 

t 

Xt - xt-1 = Ct + Y^{hj - hj-i)et-j 

3=0 

and if Ahj is a convergent series, then this differenced series may ap- 
pear to be stationary. It is also proved by Granger (1985), that provided 
ht/ht^k 1 as t + 00 for A; fixed and if y(t) is a trend; it is shown that 
then corr(st,«t~ik) — > 1 as t increases. This property will hold for all the 
generalized integrated series with d> 1/2 as well as filters that generate ex- 
plosive trends. This point can be put another way, as most series produced 
by a generalized integrated model will have a plot of Xt against Xt-i li^at lies 
tightly along the 45° line. It is an empirical fact that many macro-economic 
series have this property. 

It is thus seen that many series can be generated, including some frac- 
tional integrated series, which have the TSS properties, at least approxi- 
mately. It follows that it is not necessarily correct to identify a series as /(I) 
if it appears to have the TSS properties. 



3. THE REAL ECONOMY 

Many economic series are intrinsically positive, such as production, im- 
ports, prices, wages and employment. Most of these also have a distinctive 
upward trend also and thus can be well modelled as either an /(I) process 
or by one of the generalized integrated processes discussed in the previous 
section. This follows because for all trends of the form h[t) = ct®, then 
h{t)l\/V{t) will itself be an increasing trend and the central limit theorem 
will also apply, suggesting that for large enough t, a negative value becomes 
extremely unlikely. 

However, some economic series are also necessarily positive but have no 
clear-cut trends. The best examples are interest rates, exchange rates over 
long periods, and some real prices, including stock market indices measured 
daily and weekly. Such positive series are often found to have the TSS 
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properties but not to have any distinct trend. This combination of properties 
makes an 7(1) model, without drift, seem very inappropriate. 

Any 7(1) process without trend, such as a simple random walk, will wan- 
der widely, will cross all levels rarely but occasionally and will certainly take 
negative values. For many of the economic variables just mentioned, these 
are not realistic properties. Not only are they confined to positive values, 
but there may be control mechanisms that prevent wandering into high or 
unusual values, such as profit taking in the stock market and actions by 
central banks and governments for exchange rates and interest rates. Thus, 
we need to consider different classes of models to represent the properties of 
such real data. 



4. MODELLING APPROXIMATE INTEGRATED PROCESSES 

It is convenient to start by considering the simple random walk without 
drift 

+ et, (3.1) 

where e* is a white noise (mutually uncorrelated) series with mean zero. 
The problem with this model, if xt is known to be positive, is that if xt-i is 
small, a negative value of e* may take Xt into the unallowed negative region. 
Before considering more exotic models, it is worth first asking if this simple 
linear model can be changed to overcome the problem of generating negative 
values. There does exist a possibly relevant class of models of the form 

= P^t-i + (3.2) 



where 6ty et are an independent pair of white noise processes with 

z= 0 with probability p 

= 1 with probability 1 — p 

and et is an x.i.d. sequence of positive random variables. Lawrance and Lewis 
(1980), for example, have considered the case where et are exponentially 
distributed and called the process exponential autoregressive (EAR). These 
models can provide series with a given marginal distribution (exponential) 
and a required (AR(1)) autocorrelation sequence but the joint distribution 
is often somewhat less attractive in that the series generated have a different 
appearance than actual economic data. This problem is clearer as p becomes 
larger as then the series consists of exponentially declining sections p^xt-k 
with occasional random positive jumps. In particular, if one takes p near 
one the process will not look like an 7(1) sequence and the limit as p tends to 
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one becomes quite inappropriate. Thus, these models do not look promising 
for the purpose of this paper. 

An alternative approach is to change the distribution of €t depending on 
the value of xt-i. For example, €t may be drawn from a unimodal, symmetric 
distribution (bounded below) for “usual” values of a;t-i, from a unimodal 
distribution with most weight below the mode if a:t_i is large and the reverse 
distribution shape when Xt-i is small. However, if gets very small, that 
is near zero, the distribution would have to have all its weight above zero 
and thus the residual cannot have zero mean at this extreme. It is unclear if 
a sequence of residuals constructed in such a way would be white noise and 
they certainly could not have zero mean for all possible Xt-i values. The 
result would be a non-linear AR{1) process with heteroscedaisticity. 

One natural extension of the usual linear model is to consider a linear 
model with time-varying parameters, such as 



Xf — OitXt—l + 6t* 

A series generated in this way might look similar to an /(I) process if the 
average value of at is near one. However, if at varies independently of Xt_i 
the range of xt will still be extensive, and Xt can go negative. If one makes 
at a function of Xt_i, the resulting model is just a non-linear model of the 
form 

Xt = f{xt-i) + et. (3.4) 

These considerations suggest that the most promising model is of the form 
(3.4) when the function /(•) is chosen to approximate an J(l) and the dis- 
tribution of €t is chosen to ensure that Xt > 0. 

A simple example is a pure random walk with reflecting barriers that 
hardly ever come into operation. A sample taken from a process generated 
in this way will certainly have TSS. One can soften the barriers and get the 
same result. For example, suppose that /(x) is virtually equal to x for a 
wide range of values but with /(x) < x for x sufficiently large. This will 
ensure that the process has TSS but has finite variance. An example is 

*t+i = Xt • exp(-x?/m*) + €t+i- (3.5) 

Again, data generated by this equation, with variance (et) small, will form a 
series that has TSS, indicated by some experiments mentioned by Granger 
et al. (1984). Models such as (3.5) are then called “almost integrated” or 
AINT models. 

The function /(x) in (3.4) can be estimated using a non-par ametric tech- 
nique similar to spline curve fitting, as investigated by Engle et al. (1984). 
When used with some U.S. interest rate data, indication of the existence of 
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a curve like that in (3.5) was found, but not completely convincingly. No 
similar evidence was found for exchange rate or balance of trade data, when 
the plots of xt against Xt-i produced almost perfect 45° degree lines. 

It is interesting to note that even bounded deterministic processes can 
have a property similar to TSS. It is noted by Aizawa and Kohyama (1984) 
that a process generated by 



a:t+i = f{xi), 



where 



/(x) = x + 2®"^(l-2e)x® + c, 0<x<l/2 

= x- 2®-^(l - 2e)(l - x)® - 1/2 < X < 1, 

where e > 0 is a small, fixed perturbation and B > 3/2 gives a process xt 
that is bounded by 

0 < Xt < 1 

and having a spectrum 

f{w) ~ w-*, 

where 

A: = 3 - B{B - 1) 

and with 

w> Wc = 0(c^/^). 

Thus, the spectrum will have the “typical shape” down to a low frequency 
but is bounded at yet lower frequencies. If e is small, but positive, the 
resulting spectrum will be very difficult to differentiate from a fractional 
differenced process, d = k/2, and thus with 0 < d < 1. As Xt generated in 
this fashion is bounded, it will necessarily have finite variance. 

A class of generating mechanisms that is potentially relevant for produc- 
ing positive series with TSS is to take 

= s'(yt), 

where yt is /(d), say, and g{y)>0 for all y. A simple example would be 

Xt = Vt- 



It is clear that if y* contains a trend then so will xt provided y(y) is not 
bounded above. A complete theory for this class of models does not seem 
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to be available, although Surgailis (1981) has results for the case d < 1/2, 
so that yt has finite variance. 

Considering just the case where xt is y^ , for Xt to not have a trend, it 
will be necessary that yt has no trend in mean and no trend in variance. 
For example, if yt is a random walk without drift, starting t periods earlier, 
then yt will have variance proportioned to t, so that xt will have a mean 
proportional to t, and so will not be trend-free in mean. It thus seems 
that instantaneous transforms of generalized integrated series are unlikely 
to provide satisfactory models for trend free TSS series. They will often lack 
economic plausibility, as an accepted interpretation of the core or driving 
series yt is usually difficult to provide. 



5. CONCLUSIONS. WHERE DOES TSS COME FROM? 

It is a well-established empirical fact that many macro-economic vari- 
ables display the properties here denoted by TSS. The question considered 
in the majority of this paper is what models generate series with such a 
property, and this class is found to be wider than the /(l) models consid- 
ered by Box and Jenkins (1970). For trending series, there are many filters 
which generate trends and corresponding stochastic components, many of 
which have TSS. At this time, the analysis of trends is one of the least 
developed parts of time-series analysis and econometrics, although it is po- 
tentially a very important part of a complete analysis of economic data. For 
non-trending, positive series with TSS, the AINT, non-linear models appear 
to have more potential than many of the more obvious alternatives. 

One very important question remains, why do we observe economic se- 
ries having the TSS property? There appears to have been surprisingly little 
discussion of this question. For one group of variables, prices of speculative 
goods such 2 is stocks, bonds, currencies, gold, silver and other commodities, 
the efficient market theory provides an immediate answer. Such speculative 
prices should follow a random walk (possibly with a small drift) because 
otherwise it would be possible to have an investment stragegy that guaran- 
tees a positive return. If such a “money-pump” strategy existed, it would 
naturally be used by all investors and it follows that it cannot exist, at least 
for time periods up to the decision horizon of most investors. Thus, these 
speculative prices should be /(I) except possibly in the very long run, be- 
yond speculation investment horizons, when a model such as AINT could 
apply. Of course, the vast majority of the empirical evidence supports this 
observation. 

For most macro-economic series with TSS, speculation is not an impor- 
tant component. As I have noted elsewhere (Granger, 1984), most of these 
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variables are aggregates over huge numbers of micro-decision makers, such 
as consumers, families or companies. For example, U.S. consumption of 
non-durables is the aggregate of such consumption of over 80 million fami- 
lies. If one considers a single family, it is unlikely that this consumption will 
be a pure 7(1) series as consumption must be positive, may not be smooth 
because of changes in employment status or in income, and will not range 
widely from current levels. Because of income or borrowing constraints a 
non-linear model, such as AINT, may again appear to be appropriate for 
univariate series modelling. Of course, a better model will be one linking 
the consumption and income series for each family. If all of the individual 
family consumption series are independent of each other, it has been noted 
before (Granger, 1980) that each can obey a simple model, such 2 is a sta- 
tionary AR{l)y but for the aggregate to be 7(d), d > 0, and thus to have 
the TSS property. In particular, if e* is the consumption series for the jth 
family and it obeys the AR{1) model 



Cjt — 171 j + + Cjty 

where eyt is white noise, and the ay are drawn from the beta distribution on 

( 0 . 1 ) 



( 0 elsewhere, with p > 0, g > 0, 

then the aggregate consumption is shown to be 7(1 — g/2). 

However, the assumption that the consumption series are independent is 
quite unsicceptable and if a rather strong type of dependence is introduced, 
coming from a common factor, a quite different source of the TSS property 
becomes important. For example, suppose that 

^jt — 

where all pairs of series are independent of each other and also of the 
common factor Cf. Then the aggregate consumption 

3 



will have two components, 

Cyt with variance proportional to N 

3 
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and 

with variance proportional to AT^, 

3 

provided where N is the number of families in the aggregate. Even 

if the common factor component of each family consumption contributes very 
little to the variance of an individual Cjt series it will completely dominate the 
aggregate series if N is very large. Thus, if the families use common factors 
in their consumption decisions, such as interest rates, speculative price, or 
aggregate income (the common factor of family incomes) or even inflation 
expectations and if any of these common factors have the TSS property this 
will be sufficient to make observable aggregate series have this property. I 
think that this possibility deserves empirical investigation. 
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FRACTIONAL MATRIX CALCULUS AND THE 
DISTRIBUTION OF MULTIVARIATE TESTS 

ABSTRACT 

Fractional matrix operator methods are introduced as a new tool of dis- 
tribution theory for use in multivariate analysis and econometrics. Earlier 
work by the author on this operational calculus is reviewed and to illus- 
trate the use of these methods we give an exact distribution theory for a 
general class of tests in the multivariate linear model. This distribution 
theory unifies and generalizes previously known results, including those for 
the standard F statistic in linear regression, for Hotelling’s test and for 
Hotelling’s generalized Tq test. We also provide a simple and novel deriva- 
tion of conventional asymptotic theory as a specialization of exact theory. 
This approach is extended to generate general formulae for higher order 
asymptotic expansions. Thus, the results of the paper provide a meaningful 
unification of conventional asymptotics, higher order asymptotic expansions 
and exact finite sample distribution theory in this context. 



1. INTRODUCTION 

The purpose of this paper is to provide a short review of some new 
methods I have been working with recently in the field of econometric distri- 
bution theory. These methods have turned out to be surprisingly useful in 
furnishing solutions of a rather general nature to a wide range of problems 
that occur in finite sample econometrics. Since these problems are very sim- 
ilar to those that arise naturally in other arezis of statistical theory, notably 
multivariate analysis, I hope that the methods I have been developing will 
be of some interest to mathematical statisticians who are working in these 
related fields. 
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The methods rely on the concept of matrix fractional differentiation 
and therefore belong to an operational calculus. At an abstract level the 
techniques may be interpreted within the framework of pseudo-differential 
operators on which there is a large mathematical literature (see, for example, 
Treves, 1980). At the algebraic and purely manipulative level it is hard to 
find any references in the literature beyond those which apply to scalar 
methods of fractional calculus. Even here most attention is concentrated 
on the Riemann-Liouville definition of a fractional integral (or derivative). 
Whereas in applications to statistical distribution theory, I have found that 
a form of Weyl calculus yields the simplest and most direct results. It is also 
the most amenable to matrix generalizations. For an introduction to scalar 
fractional operators of this type the reader is referred to the books by Ross 
(1974a), Spanier and Oldham (1974) and the review article by Lavoie et al. 
(1976). 

The use of an operational calculus in problems of distribution theory has 
many natural advantages. In the first place, seemingly difficult problems may 
often be solved quite simply with rather elegant general solution formulae. 
The latter usually avoid the complications of series representations, including 
those that are expressed in terms of zonal or invariant polynomials which 
many researchers find daunting and difficult for numerical work. Second, the 
routine manipulation of operators frequently leads to simplifications which 
are not otherwise obvious. Both these advantages arise, of course, in other 
applications of operator methods. However, I have discovered that there are 
some advantages to operational methods which are peculiar to their use in 
statistical distribution theory. 

Perhaps the most important of these is that the methods provide a sim- 
ple means of unifying limiting distribution theory, asymptotic expansions 
and exact distribution theory. This is because the operator representation 
of the exact finite sample distribution often lends itself to the immediate 
derivation of the 2 isymptotic distribution and associated expansions about 
the asymptotic distribution. Thus, all three forms of distribution theory 
may often be derived from the same general formulae. An example will be 
studied later in the paper. 

A further special advantage of operational methods is that they help to 
resolve mathematical problems for which existing techniques of distribution 
theory are quite unsuited. One of the more prevalent of these in multivari- 
ate models, at least in the present stage of the development of the subject, 
arises from the presence of random matrices (usually sample covariance ma- 
trices) that are embedded in tensor formations. These tensor formations in- 
hibit the use of conventional techniques such as change of variable methods. 
Prominent examples of such problems occur in econometrics with seemingly 
unrelated regression equations, and systems estimation methods like three 
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stage least squares. In multivariate analysis many multivariate tests, such 
as the Wald test for testing coefficient restrictions in the multivariate linear 
model come into this category. Since this particular test includes so many 
commonly occurring statistics such as the F test, Hotelling’s and the 
Tq statistic we shall use it 2 is the focus of our attention in this paper as 
a prototypical application of the operational method. For other examples 
and related work the reader may refer to some other papers by the author 
(1984a, 1984b, 1985, 1986). 

2. FRACTIONAL OPERATORS 

Historically, the concept of a fractional operator arose from the attempt 
by classical mathematicians, principally Leibnitz, Euler, Liouville and Rie- 
mann, to extend the meaning of the operation of differentiation (to an in- 
tegral order) to encompass differentiation of an arbitrary order. These clas- 
sical mathematicians addressed the following question: given the operator 
D = d/dx and rules for working with to the integer order n what, if 
any, meaning may be ascribed to where a is fractional or possibly even 
complex? An interesting historical study of the evolution of ideas in this 
field is provided by Ross (1974b), who traces the origin of this search for 
an extended meaning of the differential operator to correspondence between 
Leibnitz and L’Hospital in 1695. 

Using the integral representation of the gamma function a very simple in- 
tuitive approach to fractional (complex) operators may be developed. Thus, 
if Re(a) > 0, Re{z) > 0 we have: 

roo 

= r(a)-^ / (1) 

Jo 

This formula, which is extensively used in applied mathematics, provides 
a simple mechanism for replacing an awkward power of a complex variable 
that occurs in a denominator by an integral involving an exponent which is 
much simpler to deal with. In a certain sense, this simple idea is the key to 
much of the subject and to its multivariate extensions that we shall examine 
below. 

If we now consider replacing z in (1) by the operator D — djdx we note 
that whereas D”“ is difficult to interpret is not. The operator yields 
Taylor series representations for analytical functions and may be regarded 
as a simple shift operator. Thus 



e^*f{x) = f{x - t) 



( 2 ) 
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for / analytic. This suggests that we may formally write: 



D-"/(.) = r(«) 



■/” 



f{x - 



( 3 ) 



Then if the right side of (3) is absolutely convergent it may be used as a 
definition for the fractional integral D~^f{x), Quite general operators with 
complex powers such os may now be defined by writing 



where /i = m — a, m is a positive integer and Re (a) > 0. Operators of this 
type obey the law of indices and are commutative, although this is not true 
of general matrix extensions, of course. At an abstract level, these operators 
may be used to form algebraic systems such as operator algebras, which may 
in turn be used to justify routine manipulations of the operators as algebraic 
symbols. 

After a change of variable on the right side (3) may be written a^s: 

D~“f{x) = T{a)~^ [ /(x)(x-s)““^ds, (4) 

J — OO 

which corresponds to one form of the Weyl fractional integral (see, for ex- 
ample, Miller, 1974). 

It is easy to show with this definition that: 






( 5 ) 



This may be proved using (3) for Re(a) > 0,Re(a) > 0. The result (5) then 
holds by analytic continuation for all complex a^O and for all complex a. 
Similar results extending the rules for differentiating elementary functions 
may be obtained in the same way. Another rule which is quite useful is: 

£>'‘(1 - x )-^ = 

Re{P) > 0, Re{p -h /i) > 0. 

(5) and (6) illustrate the great advantage that the Weyl operator (3) has 
over the Riemann-Liouville operator defined by 

= r(a)-^ f /(s)(x - sy-^ds 
•'aso 



( 7 ) 
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for Re (a) > 0. The finite limit of integration Xq in (7) allows us to admit 
a wider class of functions into the definition (avoiding the conditions of 
convergence required by the improper integral involved in the Weyl definition 

(4) ). However, when (7) is applied to elementary functions the results are 
usually much more complex than (5) and (6). For example, in the case of 

(5) we have 

= a-“e»*r(a)-ir(a,a(x - xq)), 

where F(q:, z) is the incomplete gamma function. This complication turns 
out to be a significant drawback to the Riemann-Liouville opeator in multi- 
variate extensions and in applications to distribution theory. I have, there- 
fore, found it most useful in my own work to employ (3) and its various 
generalizations rather than (7). 

Multivariate extensions follow from the matrix gamma integral: 

(det Z)-°‘ = r„(a)-^ f etr(-S^)(det (8) 

Js>0 

where Z is an n x n matrix with Re(Z) > 0 and Re(a) > (n — 1)/2, and etr(-) 
represents e to the power of the trace of the matrix. r„(o:) is the multivariate 
gamma function which may be evaluated as ryi(a) = 117=1 ^( 0 : — 

(t — l)/2). The integral (8) is extensively used in multivariate analysis. Its 
significance was first brought into prominence in the remarkable paper by 
Herz (1955). 

We may now proceed as in the scalar case by introducing the matrix op- 
erator dZ = djdZ. Whereas (det dZ)~°^ is difficult to interpret eiT[—dZS) 
is not. In fact, if f{Z) is an analytic function of the matrix variate Z the 
operator eti{-dZS) yields the matrix Taylor series representation 

eti{-dZS)f{Z) = f{Z - 5), (9) 

generalizing (2). We may therefore define 

D^^nZ) = r„(a)-i f f{Z - 5)(det 5)“-(”+^)/=“dS; 

Js >0 

Dz = det dZ (10) 

provided the integral is absolutely convergent and Re(a) > (n — l)/2. The 
general case of an arbitrary complex power of Dz may be dealt with in the 
same way as the scalar case by setting 

Dy{Z) = Dr{D'Sf{Z)} 

for /i = m — a with m a positive integer and Re(a) > (n — l)/2. 
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Elementary functions of matrix argument may be complex differentiated 
as before. Thus 

D^“etr(A^) = etr(AZ)(det (11) 

generalizes (5) and may be proved for Re(A) > 0, Re(a) > (n — l)/2 using 
(10). The formula (11) holds by analytic continuation for all nonsingular A 
and for all complex a. In a similar way, we find 

D% det(/ - Z)-^ = 

Re(^) > (n - l)/2, Re(;9 + /z) > (n - l)/2, (12) 



generalizing (6). 

It is also useful to work with more complicated operators than Dz • For 
example, if R is a qx nm matrix of rank q < nm and M is a positive definite 
m X m matrix, then we may define 

[det {R{dZ 0 M)R')f f{Z) 

= r,(a)-^ f [eti {-R{dZ ® M)R'S} f{Z)] (det 5)“"(«+^)/^dS'(13) 
Js >0 

if the integral converges absolutely. The exponent R{dZ 0 M)R'S in the 
integrand of (13) is linear in the operator dZ and we may write: 

tr [-R{dZ 0 M)R^S] = tr[-aZQ(5)], 

where the n X n matrix Q is linear in the elements of S. Thus, (13) has the 
form: 

r,(a)"‘ / /(^-Q(5))(det 

Js>0 

Extensions to more complex tensor formations of operators are possible in 
an analogous fashion. Some of these are given and applied in one of the 
author’s paper (1985) on the subject. When f{Z) is an elementary function 
like eti{ZA) one obtains extensions of rules such as (11): 

det {R{dZ M)R'y eti{AZ) = etr(AZ) det (R( A 0 M)i?')"“ • (14) 

Once again (14) is proved for Re{Z) > 0 and Re(a) > {q — l)/2 and then 
analytically continued for all nonsingular A and all complex a. 
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3. MULTIVARIATE TESTS 

To illustrate the use of these operator methods in distribution theory we 
shall consider some commonly occurring multivariate tests. What we present 
here will in large part be a review of work already done by the author in 
(1984, 1986) and the reader is referred to these papers for full details and 
generalizations. However, we shall present some new results on asymptotic 
expansions and exact distribution functions. 

We shall be concerned with the multivariate linear model 

yt = Axt + Ut\ (15) 

yt is a vector of n dependent variables, A is an n xp matrix of parameters, Xt 
is a vector of nonrandom independent variables and the Ut are i.i.d. iV(0, Q) 
errors with H positive definite. Let us suppose that we are interested in a 
general linear hypothesis involving the elements of A, which we write in null 
and alternative form as: 

Hq : R vec A = Hi : R vec A — r = 6 ^ 0, (16) 

where R ia a. qxnp matrix of known constants of rank r is a known vector 
and vec(A) stacks the rows of A. 

From least squares estimation of (15) we have: 

A* = Y'XiX'X)-^, n* = Y'{I - Px)Y/N (17) 

where V' = [yi,...,yr], X' = [xi,...,xt], Px = X{X'X)-^X' and N = 
T—p. We take X to be a matrix of full rank p <T and define M = {X'X)~^. 
The Wald statistic for testing the hypothesis (16) is 

W = {R vec A* - r)' {R{n* ® M)R'}~^ {R vec A* - r) 

= Nl'Bi, (18) 

where i = R vec A* - r, is N{byV) under Hi with V = R{Q (g) M)R', 
and B = {R{C (g> C = Y\I = Px)Y is central Wishart with 

covariance matrix H and N degrees of freedom. 

We define y = PBi and write y in canonical form as 

y = g'Gg, ( 19 ) 

where g = is AT(m, 7^), m = and G~^ = V“^/^{i?(C (g) 

. With this notation we see that y and W are simply positive 
definite quadratic forms in normal variates, conditional on C. The distri- 
bution problem becomes one of integrating up this conditional distribution 
over the distribution of C. 

Important special cases of the statistic W are as follows. 
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(i) The regression F statistic 

Set n = 1 , A = a, JIo ' Fa = fy fl* = 8 ^ and then 

W = {Ra* - r)' {Ra* - r)/s^ 

= c ( 20 ) 

where Fq^N denotes a variate with an F distribution with q and N degrees 
of freedom. In (20) we use the symbol to signify equality in distribution 
and the letter “c” to represent a constant. These notations will be used 
throughout the paper. 

(ii) Hotelling^s statistic 

Set R = Ri ® T 2 ^ Hq \ R\Ar2 = r and then 

W = (RiA*r2 - rYlRin*R[]-^(RiA*r2 - r)/f2Mr2 

= C x'S~^X = C Fq^N^q^iy (2l) 

with X = iV(0, RiVlRi) and S = Wq{Ny RiCtRi) under the null; x and S are 
of course independent. 

(Hi) The Tq statistic 

Set R = Ri 0 R 2 i Ho : R1AR2 = r with Ri qiXn and R2 m x ^2 • Then 

W = yec{RiA*R 2 - (g) i?iMi^ 2 ]“^vec(i?iA*i ?2 - r) 

= tr [(i^iA*i ^2 - r)'(i?in*i?i)"^(i?iA*i ?2 - r)(E^MjR 2 )“"^] 

= c tr 

= c tr , (22) 

with X = matrix iV(0, ® = ^qi{Hy RiQRi) and Si = 

Wq^{q2, Ri^R[) under the null. Because of invariance to the covariance 
matrix in (22) we may treat Si as Wq^{q 2 y /,Jand52 asiy,,(Ar,7,J; Si 
and S2 are independent. 

Interestingly, the exact distribution of the statistic tr(5i5^^) has not 
been found in the statistical literature, in spite of apparently substantial 
efforts by many researchers (see Pillai (1976, 1977) and Muirhead (1982) for 
reviews). Many conjectures have been made about the form of the exact 
density of this statistic. The classic article by Constantine (1966) which 
gives a series representation that is valid over the unit interval [0, 1] is still 
perhaps the most general treatment. We shall show below how the distribu- 
tion may be found in the general case quite simply by operator algebra. A 
full treatment is available in the author’s paper (1986). 
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4. THE NULL DISTRIBUTION 



It is shown by Phillips (1986, equation (32)) that the null density of W 
in the general C2ise (18) is given by: 






pdf(«») = [det (L(dX <8> oFo (-L(dX ® I)L', w/N) 

(23) 

where 



L = [iJ(n ® R 



(24) 



The function oFo{—L{dX®I)L\ w/N) in (23) is a linear operator which 
may be explicitly represented as: 

f etr {-{w/N)L{dX (g) I)Vhh') (^), 

where {dh/j denotes the normalized invariant measure on the sphere Vi^q = 
{h : h'h = 1}. An alternative representation in terms of an absolutely 
convergent operator power series is also available: 



E 

3=0 



{-l)^\w/N)^'Cj {L[dX^I)V) 

3\CAI) 



where Cj{-) denotes the top order zonal polynomial of degree j, for which 
explicit formulae were given by James (1964). 

The simplicity of (23) is unusually striking. Yet, as we shall see, all 
existing exact distribution theory for the null case is embodied in this for- 
mula. Moreover, (23) also delivers the appropriate asymptotic theory and 
asymptotic expansions with little effort. In the following specializations we 
shall use the notational reductions detailed for these special cases in Section 
3. 

(i) The regression F statistic 

pdf(ii;) = ^(da:)^/^e“^®^/‘^(l — x)“^/^j 

= CFg^N- 
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The reductions in the second and third lines above follow directly from the 
rules (5) and (6) given earlier for fractional differentiation. 

(ii) Hotelling^s statistic 

Noting that L = Li (S) ^2 with we find that the density of 

W is: 



pdf(w) = \det{LidXL\yf^ oFo{-LidXL\,w/N) 

= [(det dXiiy^^aoFo{-dXn,w/N) 

= [ oFo{-dXn,»)/N)det{I - 



= etr - {w/N)dXiihh'{dh) 

■ det{I - 

Jxi,=o 

= f det(7 + {w/N)hh')~^^+^^^^{dh) 

Jyi., 



Xii=0 



= cFq^N-q^l. 



In the second line of this argument Xu is a gi X q 2 matrix of auxiliary 
variates obtained from the q X q matrix X by transforming X — > PXP^ 
where P' = [L[yK*] is orthogonal. Note that under this transformation 
dX — > P^dXP and L\dXLi — ► dXn, giving the stated result. 

(Hi) The Tq statistic 

L = Li 0 L 2 , LiLi = Iq^^ q = qiq 2 and the density of W is: 



pdf(tu) = [det(LiaXLi)«=/* oFo{-LidXL'i ® Ig^,w/N) 

= [ oFoi-dXii ® Ig,,w/N) 

JXii=0 

— f etT{-{w/N) 
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• {dXii (g) I)hh'} {dh) det(7 - (25) 

J Xii=0 

= f det (/ + {w {dh). (26) 

Jvi,. 

where Q = = (^i> • • • > ^ < 1 we may 

expand the determinantal expression in the integrand of (26) giving 



pdf(w) = 1 ^ L - H’Jp — t ^ ) C^{Q){dh) 

^ \ / K 









(-w/jy)* 






which is the series obtained by Constantine (1966) for the null distribution. 
The integration over V\^q leading to (27) may be obtained quite simply using 
operator methods. The reader is referred to Phillips (1984b) where full 
details are given. 

An alternative everywhere convergent series is obtained by working from 
(25). Once again details are provided by Phillips (1984b). We state only the 
final result here: 



pdf(it;) = 



c 

{N H- 






(28) 



where the summations are over all partitions 0^ k oik into < q\ parts, and 
the are certain constants. 

(iv) Asymptotic theory 

We employ the simple asymptotic representation 
det(/ - ~ etr(NX/2) 



for X ^ 0 in (23) and deduce immediately that: 



pdf(it;) ^ 



^ql2-l^-qf2 

2«/*r (I) 



= x^ 



(29) 



Thus, the asymptotic distribution appears as a special case of (23) in a single 
step. 
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(v) Higher order asymptotics 

We transform X Z = NX in (23) giving dX = NdZ and: 

pdf(«;) = [det {L{dZ ® 

• o^’o {-L(dZ ® I)L\ w) det( J - 

J ^=0 

We now expand the determinant al factor as W | oo: 
det(/ - ZjN)~^f^ = exp |-^lndet | 

—pjfE^d) } 



f 1 

etr 1 -Z I exp < - > , . 

V2 J l2^(j + l)iVJ 



(1/2)^ 



tr Z^^+Hr Z^'^+^ 

'n n (ii + 1) (j 2 + 1) • • • (ii + 1) 



• (31) 



In the final expression (31) the summation is over all £-tuples of positive 
integers (ji, . . . , j^) satisfying 

t 

= 1,2,...,*; (i<i<£). 

»=i 

We deduce from (30) and(31) the following general form for the asymptotic 
expansion of the density of W to an arbitrary order as iV | oo: 



pdf(ii;) 



u;9/2-ie-W2 

2«/2r(g/2) 
r' 1 ^(1/2)"' 



h. r(g/2)(ia + l)...(j + l) 

• [det (L(dZ ® I)L'f^ qFq {~L{dZ ® I)L', w) 

• etr (^z\ tr tr . . .tr 



( 32 ) 
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To 0{N we have: 

pdf(«;) = X* + 2^^ H 

• oFo ^ I)L' ^w)eii {^\-z\tx 

\2 / Jj^=o 

+ o(iV-^). (33) 

The correction term of 0(iV“^) in (33) may be evaluted using the rules of 
operator calculus given earlier. The final result may be shown to correspond 
to the expression obtained by more conventional methods by Phillips (1984c). 



5. THE DISTRIBUTION FUNCTION 

We may also derive the cdf of the null distribution of W , We shall use 
the incomplete gamma integral: 

r e-«^y‘‘-^dy = a-^Y“ iFi{a, a+1; -Y?), 

Jo 

where Re(a), Re(f) > 0 (Erdeyli, 1953, p. 266). We have: 
cdf (u;) = P{W < w) 

= ^det{L{dX ® 

• oFo {-L{dX ® I)L', y/Z) det( J - dy 

= [i\r«/*r(9/2)] [det {L{dX ® 

• exp {[ylN)h'L(,dX ® I)L'h) {dh) det(7 - dy. 

Interchanging the orders of operation in the above expression, which is per- 
missible in view of the continuity of the integrand and the compactness of 
the domains of integration, we obtain: 

• f iFi{q/2,q/2+V,-{w/N)n'L{dX®I)L'h){dh) 

Jvi., 




232 



PETER C. B. PHILLIPS 



• det(/- 

• {q/2,q/2+l-,-{w/N),L{dX®I)L') 



In (34) , 1 is a confluent hypergeometric function with two matrix argu- 
ments (see James, 1964). In the present case one of the arguments is scalar 
and the function admits a series representation in terms of top order zonal 
polynomials. 



6. THE NON-NULL DISTRIBUTION 

Analysis of the non-null distribution of W proceeds along similar lines. 
The derivations are more complicated and the reader is referred to the au- 
thor’s paper (1986) for details. The flnal result for the density may be 
expressed as: 



^q/2-lg-m'm/2 . 

■"“W = 

■f exp{-{w/N)h'L{dX®I)L'h} 

Jyi., 

■ oFi L{dX ® I)L'mm'h^ {dh) 

(35) 

An alternative series representation of (35) is possible in terms of top order 
invariant polynomials (Davis, 1979) with two matrix argument operators. 
Specializations to the non-null distributions of the statistics in Section 4 
and to the asymptotic theory of W under local alternatives are also given 
by Phillips (1986). 



7. CONCLUSIONS 



There seems to be considerable scope for applying the methods outlined 
here to other problems of distribution theory in multivariate analysis. The 
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author (1984a) has used similar methods in studying the distribution of the 
Stein-rule estimator in linear regression. The latter results have recently 
been extended by Knight (1986) to nonnormal errors. 

The technique of developing general formulae for asymptotic expansions 
from exact theory also seems to be very promising. This approach avoids 
much of the tiresome algebraic manipulation that is a feature of the tra- 
ditional work on Edgeworth expansions. Moreover, the final formulae are 
simpler in form and may be used to obtain expansions to an arbitrary order, 
which is very difficult with the traditional approach. 

Here and elsewhere in the application of these methods to problems 
of distribution theory it would be helpful to have a glossary of results on 
fractional and matrix fractional calculus. Until now I have been developing 
rules for working with these operators as the need for them arose. With a 
systematic set of formulae for the action of matrix fractional operators on 
elementary and commonly occurring special functions as well as rules for 
operation on products and compositions of functions of matrix argument it 
should be possible to make progress on many presently unsolved problems 
of multivariate distribution theory. 
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ON ROBUSTNESS OF TESTS OF LINEAR 
RESTRICTIONS IN REGRESSION MODELS 
WITH ELLIPTICAL ERROR DISTRIBUTIONS 

1. INTRODUCTION 

Testing a set of linear restrictions in a regression model is usually per- 
formed with the help of the F-statistic, or the statistic based on the likelihood 
ratio {LR). More recently two other procedures, the Lagrangian Multiplier 
or Rao Score {RS) test due to Rao (1947) and Silvey (1959), and the Wald 
{W) test (1943), have become popular with econometricians; see, for exam- 
ple, Breusch and Pagan (1980) and Evans and Savin (1982). 

A statistic can be called numerically robust over a class of error distribu- 
tions if its values are independent of the specific error distribution from that 
class. If the statistic is such that no matter which error distribution from 
the class of distributions is considered the test criterion remains unchanged 
then the statistic is inferentially robust over that class. 

If the statistics, F, LR, RS and W are constructed based on the as- 
sumption of the spherical normal error distribution (normal error with the 
covariance matrix I), then F and LR are numerically robust against the 
class of all monotonically decreasing continuous spherical distributions, but 
RS and W are not. However, all these statistics are inferentially robust 
over this class, thus the test conclusions reached under the assumption of 
normality will not be overturned if the error distribution is spherical. These 
results are derived by Ullah and Zinde- Walsh (1984, 1985). 

In this paper we consider the issues of numerical and inferential robust- 
ness of F, LR, RS and W tests, based on the assumption of spherical nor- 
mality, against the general class of elliptical error distributions (errors with 
the nonscalar covariance matrix E). We provide the necessary and suifiicient 
conditions for numerical robustness for the class of covariance matrices of- 
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ten used in econometrics, for example, autoregressive {AR), moving average 
[MA) and heteroskedastic. Our investigation shows that for these covari- 
ance matrices the numerical robustness of test statistics under consideration 
is rare. Our results are more general than those given by Ghosh and Sinha 
(1980) and Sinha and Mukhopadhyay (1981), who considered only intraclass 
covariance structure. Also, while Khatri (1981) gave conditions for numeri- 
cal robustness in terms of pairs of data and covariance matrices, robustness 
over classes of covariance matrices considered here ha^ not been examined 
in his paper. 

Our investigation also showed the limitations of exact inferential rubust- 
ness. We, therefore, looked into the robustness of tests by developing bounds 
for critical values /hich will ensure that the conclusions based on the usual 
tests are not affected against a particular cleiss of distributions. Bounds for 
critical values of test statistics for t and F-tests for first-order AR, MA and 
ARM A processes have been tabulated (for normal errors) by Vinod (1976), 
Vinod and Ullah (1981) and Kiviet (1980). Their calculations require knowl- 
edge of all the eigenvalues of the matrices which characterize these processes 
and are quite complex. The situation becomes more complicated for higher 
order ARM A processes. Our method offers bounds which are cruder for 
the specific processes considered by Vinod and Ullah and Kiviet, but they 
have the advantage of computational simplicity and generality; that is, they 
provide critical values that guarantee robustness of the test conclusions, for 
any S matrix, over wide classes of error distributions, and would utilize only 
the highest and lowest eigenvalues of the covariance matrix. 

The plan of the paper is as follows. Section 2 develops the notation and 
definitions. Section 3 deals with the problem of numerical robustness and 
some applications. In Section 4 we examine the question of robustness of 
test conclusions and provide our bounds on the critical values of statistics. 
Finally, the proofs of the lemmas and theorems are presented in Section 5. 



2. DEFINITIONS AND NOTATION 

We consider the general linear regression model 

y = XP + u, (2.1) 

where y is an n x 1 vector of observations, A is an n x p known matrix 
of rank p<n,^isapxl unknown parameter vector and u is an n x 1 
disturbance vector whose probability density function is 

= ( 2 . 2 ) 
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with a monotonically decreasing /, and positive definite S. If E = /, the 
distribution given by (2.2) reduces to a spherically symmetric distribution. 

Our problem is to test a set of r linear restrictions Hq : Rp = 0 against 
Hi : RP i=- 0, where i? is an r x p known matrix of rank r. Under this 
hypothesis, P can be partitioned as 

' Po ■ 

Lpo\ ^ 



where rank L — r\ \i X = [Xi : X2] with Xi containing p — r columns of X 
and X2 the remaining r, then for Xq = X± + X2L the model under Hq cai^ 
be rewritten as 

y = XqPo + tt. (2*3) 

We denote by F, LRy RSy and W the values of the statistics calcu- 
lated according to the usual formulae under the eissumption of multivariate 
normality of u. F can be written as 



F = 



{y-Xo$o)' {y-Xo$o) ^ 
{y - Xfi)' (y - xp) 



q 



q = n-p, 



(2.4) 



where Po and P are the respective lezist squares estimators of Pq and Py and 
LRy RS and W can be expressed through F, respectively, as 

Li? = n log (1 + , RS = n^F/ , W = rJ-F. (2.5) 

We introduce the following projection matrices: 

P = X{X'X)-^X'-, A = I-P; Po = Xo {X'^XoV^ = I~Po, (2.6) 



where I is the identity matrix. The following properties can be easily verified: 



rank P > rank Pq; rank A < rank Aq] PPq — Pq] 
AAo =:A;PX = X; PXq = Xq. 




u'AqU 
u'Au ’ 



Using (2.6) and noting that 
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we can rewrite (2.4) and (2.5) as 



F = 
RS = n 



_ u* ( Aq - a) u q 



u' Ati r ’ 

u' ( Ao - A) u 
u'Aou ’ 



LR = n log 



u'Aqu 
u'Au * 



u'Au 



( 2 . 8 ) 



If, in fact, u hsis the spherical normal distribution, all the statistics 
have known distributions. If the error distribution is spherical, that is, the 
likelihood function is given by (2.2) with S = /, we denote the values of the 
statistics derived from this likelihood by i^^(= F)^ LR^y RS^j,^ and 

We denote the statistics calculated under the assumption of elliptical 
normality by Fe, and Wj^. 

It is known that 






is the familiar F-statistic for testing H’o, with 

4oe = (X'S-^Xo)"' (2.10) 

Further, as in (2.5) we have 

LjRe = nlog ^1 -h j RS^ = n-Fj^/ ^1 + ^ 

Ws = n-Fs. (2.11) 

For a general elliptical distribution in (2.2), denote the appropriate statistics 
by ■R'S^^.e, ^^,e* 

Ullah and Zinde- Walsh (1984, 1985) have analyzed the numerical robust- 
ness of LR, RS and W tests against spherically symmetric distributions. In 
particular, they have shown that 



(y - -^o^oe) ^ ^ (y- ^o^oe) 

I (y-X/9E)'s-i(y-A:4s) 



- 1 



(2.9) 



LR = LR^, RS = il^^^RS^, and W = (2.12) 

where and are constants which depend on the spherical distribution 
0(u). Thus LR is numerically robust against non-normality but RS and W 
are not. 
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The elliptical distribution (2.2) can be transformed into a spherical dis- 
tribution by the substitution u = Siv. Thus, for this C£ise, from (2.12) we 
easily obtain 

LRji = RSs = = p;^W^,s- (2.13) 

Here is numerically robust against non-normality, but and 
Wj: are not. 



3. MAIN RESULTS ON NUMERICAL ROBUSTNESS 

It was mentioned in Section 2 that Ullah and Zinde- Walsh (1984, 1985) 
analyzed the numerical robustness of E, LR, RS and W against spherically 
symmetric distributions. Robustness of LR^y RSj^ and Wg against non- 
normality in elliptically symmetric distributions follows from that analysis. 
Here we look into robustness of F, LRy RS and W (under spherical normal- 
ity) against elliptical normal distributions by comparing the values of these 
statistics with the values Fe, LRj>y RSj^ and eis in (2.9)-(2.11). 

Conclusions about robustness against general elliptical distributions will 
follow in view of the relationships given in (2.13). We also note that we 
derive the results for parametric clztsses of S matrices often used in economic 
literature. 

For deriving the conditions under which Fe (or LjRe) is numerically 
robust over some cl2U3S of E given the data matrix A, we consider 

= 1 H — Fe (3*1) 

q 

and examine the conditions under which = io = 1 ^F. 

Consider a cleuss Qp of matrices S with = I — Hpy where Hp is some 
symmetric matrix over some parameter space B E R^y p — {piy .y Pk) G B 
with Ho = 0 for Pi = ••• = Pk — 0 and | y*Hpy | < y'y for all possible 
y, p e B. 

We now state the following lemmas which are used in the proof of The- 
orem 1. 



Lemma 1. For aJi S G £e can be represented as follows: 

. _ y'AoV + y'AoT^Aoy 
y>Ay + y'AT,Ay ’ 



(3.2) 



where 



T, = f{H, A); = Ao) 



(3.3) 
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with the explicit form of the function f given by (5.5) = A + ATpA. Here A 
and Aq are defined by (2.6). 

Lemma 2, Suppose that for some symmetric matrix T, 

(y'AoTAoy) {y'Ay) = {y'ATAy) {y'Aoy) . (3.4) 

Then AqTAq = OAq and TA = 6 Ay where 0 is some constant. 

Theorem 1. Suppose that Hp is a polynomial or a convergent series 
in the parameters with symmetric matrices as coefficients. If 

T{riy . . .,r/g) is the coefficient of in ^ = Z) 

if follows that 

AqT (ri,. . .,ffc) Ao = ^(ri,. . .,ffc) Ao (3.5) 

for some constant ^(ri, . . . , fit). 

For proofs of the lemmas and Theorem 1, see Section 5. 

Remark 1. Suppose that AqH^Ao = Ok,pAo, k = 1,2, . . ., where $k,p is a 
scalar function of p\y. . .^pk- Then £2 = £q. 

To prove the above statement one only needs to note that 



AH^A = AAoff^AoA = Ah.pAoA = 0k.pA 



and to substitute into (3.3) and (3.2). 

Theorem 1 and Remark 1 give the necessary and sufficient conditions for 
the constancy of £2 and, therefore, for the numerical robustness of F, R5, 
LR and W statistics against elliptically normal errors that can be described 
by a variance-covariance matrix SGflp, Hp= {212“^ = /—^^ with Ep 
being a polynomial or convergent series }. 

The stringency of these conditions makes numerical robustness an ex- 
ception rather than the rule. No process with non-trivial Hp gives rise to 
robust statistics for all possible X and Xq; therefore, numerical robustness 
has applications mainly for experimental design. Also, of course, one can 
always check if the observation matrix X just happened to lead to statis- 
tics numerically robust against a particular process in the errors, but, if so, 
it would be strictly a matter of luck. We show that our results generalize 
those on experimental design with intraclass covariance structure by Ghosh 
and Sinha (1980) and examine the possibilities for numerical robustness over 
heteroskedastic and ARM A processes. 
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3.1 Implications for Intraclass Covariance Structures 

The result of Ghosh and Sinha (1980, Theorem 3.1) follows as a special 
case of our theorem. Indeed, they considered E = (! — />) + pin X 
— </><!, where is a column vector of ones, and hence 1„ x Ij^ = nQ, 

where Q is a projector of rank 1 into the subspace spanned by 1^. Here 

(l + (n - 1)/)) ^ 

Direct application of Theorem 1 to S“^ implies that AqQAq = OAq. Since 
rank Aq > rank A > 1, it follows that ^ = 0, AqI x 1 'Aq = 0, and AqI = 0; 
therefore, 1^^ is the eigenvector of both Pq and P as stated in the result of 
Ghosh and Sinha. It is also easy to verify that Theorem 3.2 of the same 
paper follows from our results. 

3.2 Implications for Heteroskedastic Errors 

Theorem 1 also provides a characterization of the class of heteroskedastic 
E for a given A and restriction R over which il^ is required that 

E = 7of + A, where the diagonal matrix A is such that AAq = 0; this implies 
that the A matrix has block-diagonal structure with a block of zeros. 

3.3 Implications of Theorem 1 for Autoregressive (AR) Error Structures 
The matrix E”^ is known for autoregressive processes of order A;, AjR(A;). 

If we set all but one of the parameters of AR{k)y namely, pi, P2y->Pki 
equal to zero, i.e., Pk ^ 0, pi = 0 for i ^ fc, then E~^ reduces to the matrix 
I + pCik + p^C 2 k- Here C\k is the matrix with elements (Cijk)ty equal to —1 
if 1 1 — y I — k and 0 otherwise, and C 2 k is a diagonal matrix with elements 
{C 2 k)ij equal to 1 if A; < i = y < n - A: and 0 otherwise. We shall denote this 
process by AR{kf0). A necessary condition for constancy of for Ai?(A;,0) 
is that AoCikAo = OAq, where Aq is a projector of rank no less than 2. 
This implies that Cik should have at least two identical eigenvalues, which 
is true only ifA;>§-fl. IfA;>y + l then Cik has a kernel of dimension 
n - 2(n — A;) = 2A; - n > 2. In this case is constant for all Aq that project 
into the intersection of the kernel of the matrix Cik and either the image or 
the kernel of C 2 k- Then, of course, ^ = 0 and A0C2 A; Aq = 7A0 with 7 = 0 or 
1. It is not hard to check that for these k and Aq this suffices for constancy 
of £e. 




3.4 Implications for Moving Average Error Structures 

The class of MA{k, 0) error structures, where all but the parameter of 
order k are zeros, is represented by E = + w^)I + wCik]y where Cik 
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is the same as for the AR{k) structure. We denote (7^(1 -f- w^) by 7 and 
u;/(l -h u;^) by p. Theorem 1 can be directly applied to this class of S after 

noting that = (1 + 1^){I + P^ik P^^ik H )• ^°r this S, £e is robust 

only if AoCikAo = 0Aq. 

Here again if A: > § + 1, matrices Aq, which yield robust test statistics, 
exist. Such an Aq would project onto the kernel of Cik. 

Thus, we conclude that there are some data structures that produce 
statistics that are robust over AR and MA error processes of sufficiently 
high orders (which do not include lower order components). We also notice 
that the higher the order of the error process the larger the class of data 
matrices that give robust statistics. This is hardly surprising, since in the 
limiting case, processes of order higher than the dimensions of the data will 
not affect the statistics at all. 

We also note that, in general, the larger the number of equal eigenvalues, 
including zeros, of H (or the larger the dimension of any projector in the 
canonical representation of the symmetric matrix H) the more possibilities 
for numerically robust statistics. 

Note that if £e = £o> then F = and LR = but unless the 

distribution is elliptical normal RS ^ RS^j:, and W ^ 



4. INFERENTIAL ROBUSTNESS AND BOUNDS 
ON CRITICAL VALUES 



If two test statistics are such that one is a monotonic function of the 
other, then any probabilistic statement about one implies a similar statement 
about the other. Thus, if one falls beyond a critical value for some level of the 
test, so does the other. Therefore, as was stated by Ullah and Zinde- Walsh 
(1985) (and can be seen immediately from (2.12)) RS and W are inferentially 
robust over the class of all spherical monotonic error distributions. 

Here we examine the inferential robustness of the test statistics F^ LR^ 
RS and IV, calculated under the assumption of spherical normality, for gen- 
eral elliptic distributions. To emphasize this we denote the statistics by 
F(S), LR(S), R5(S) and W{Z). Since the test statistics are inferentially 
robust against spherical distributions it will not make any difference to our 
conclusions whether the statistics bear the subscript ^ or not. 

Consider the variate 5(S) = F(S)|, where 



5(S) = 



u'Aiu 

ii'A2u’ 



(4.1) 



with Ai = Ao — A, A2 = A as defined in (2.6). The critical values for 
5(E) depend on the matrix E. Indeed, if one considers the transformation 
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u = Sav, then 



5(E) = 






(4.2) 



where v is spherically symmetric. Denote S{I) by S. 

We observe that as long as S'(S) is inferentially robust over a class Q ot 
S matrices all the statistics F(S), Li2(S), jR5(S) and W(S) are inferentially 
robust over Q as well. We assume that I Ed. 

Denote by 0(n) the group of orthogonal n X n matrices in the Euclidean 
space R^. For any T E 0{n) the distribution of S and of iS'(S) in (4.2) is 
invariant with respect to the transformation T : 



Lemma 3. For a positive definite matrix and any two mutually or- 
thogonal idempotent matrices Ai, A 2 , there exists T E 0(n) such that 
Ai = AiL^T is a diagonal matrix for 1 = 1, 2. 



Proof. See Section 5. 

This lemma allows us to rewrite 5'(E) by substituting w = Tv as 



5(E) = 



w'Aiw 

w'A2W^ 



(4.3) 



where we can write 

Ai = diag 0,...,0), A»i< •••</**, k = p - m, 

A 2 = diag (0,...,0, Pp+i < • • ■ < fi„, (4.4) 

where diag(. . .) denotes a diagonal matrix with given diagonal elements. 

A similar transformation for S yields 



tv'Qiw 

w'Q2W 



(4.5) 



with Qi = diag (!,...,!, 0, . . . , 0), where the first k elements equal 1, and 
Q 2 = diag (0, . . . , 0, 1, . . . 1), where the last n — p elements equal 1. Note 
that the transformation of 5 may be performed with a matrix from 0(n) 
different from T, but the distributions of 5(E) and S are not affected by an 
orthogonal transformation of the spherical variable. 

Clearly the following inequality holds; 



^ 5 . 



— S < S(S) < 

Mn 



MP+1 



(4.6) 
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It follows from (4.3) that all the values for 5(E) within the bounds given by 
(4.6) are realized for some w. Therefore, a sufficient condition for inferential 
robustness is that = A*Jb/Mp+i* 

However, this type of condition is hardly less restrictive than those de- 
manded for numerical robustness. 

We thus seek bounds on the critical values of the statistics i^(S), LR{E), 
i?5(S), W (S) which will assure the test conclusions over some class Q as long 
as the respective values calculated according to (2.4) and (2.5) are outside 
these bounds. 

Since Ai, A 2 are projectors with eigenvalues equal to 0 or 1, the eigen- 
values of AS 5 are bounded by the eigenvalues of S^. Denote by A^ax the 
highest and Amin the lowest eigenvalue of S. Also denote by 8 the ratio 
Amax/Amin* Clearly 

< w'AiW < Amaxty'QtW;- (4.7) 

Therefore 

< 5(E) < 6S. (4.8) 

This inequality holds irrespective of Ai, A 2 and the particular S, and only 
reflects one characteristic of S — the ratio of the highest to lowest eigenvalues. 
The bigger 6 is the more S is distinguished from I for which 5 = 1. 

If for any two statistics Si and ^ 2 , the inequality Si < S 2 holds every- 
where, then their cumulative distributions Gi{x) = Prob(5* < a:),i = 1, 2, 
are related £is follows: 

Gi{x) > G 2 (x), 

and, therefore, for some level of the test the critical values satisfy 

5r < 52 "". 

From this observation and (4.8) we obtain the following theorem. 

Theorem 2. The critical values Fcr(S) are located within the following 
intervals dependent on 8 — the ratio of highest to lowest eigenvalues ofS: 

S-^Fcr < Fcr(S) < SF„. (4.9) 

Corollary. The following inequalities hold: 

LRcr + nlog(5-‘ - (1 - /Icr) < iiZ(S) 

< LRcr + nlog(5 -[S - l)/£cr), 

£cr = exp (Li?cr/n) ; 

6 ^RScrfi ^ ^ SnRScr 

n - (1 - 5-1) RScr - - n + (5 - l)RS,r ’ 

S-'^Wcr < W"cr(S) < SWcr. 



(4.10) 
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The inequalities given in (4.10) are derived easily from (4.8) and (2.5). 

4.1 Discussion of the Results 

The relationship given by (4.9) has the following immediate interpre- 
tations for the F-test. Firstly, if a class O of E matrices is such that the 
biggest value of S is limited by some 6* y then the test conclusions are the 
same for any E as long as either F/Fcr > S* or Fcr/F > 6*, where F and 
Fcr are, respectively, the value of the test statistic according to (2.4) and the 
critical value for the hypothesis test under the spherical normal. Secondly, 
if F’ > Fcr {F < Fcr), then the test conclusions are robust over the class O 
of E matrices with S, the ratio of maximum to minimum eigenvalues, such 
that ^(E) < F/Fcr (S(S) < Fcr/F). 

Since the relationship for W in (4.10) is similar to (4.9) for jP, the same 
conclusions apply. A simple examination of (4.10) shows that the bounds on 
the critical values for i?5(E) are inside the interval ^jR5cr], thus, 

the conclusions made above hold for RS as well. 

The following example demonstrates how our bounds compare to those 
obtained by Vinod (1976) and Vinod and Ullah (1981) for the t statistic 
under an AjR( 1) process. Suppose that p = .5. Then the eigenvalues, of the 
variance-covariance matrix are contained between the asymptotic (n — > oo) 
maximum and minimum eigenvalues, l+p^+2p = 2.25 and l+p^—2p = .25. 
Thus, the bounds on the critical val ue of the t statistic can be calculated 
based on the square root of the ratio \/2.25/.25 = 3. The critical values given 
by Vinod and Ullah are tabulated according to the number of restrictions 
p and sample size n. If n = 50, p = 5, for instance, their Table 4.1 gives 
1.14 and 3.93 as the lower and upper bounds, respectively, at the 5% level, 
whereas our calculation, which involves only dividing and multiplying the 
standard critical value by 3, gives .671 and 6.042 as the lower and upper 
bounds, respectively. 

However, there are three ways in which our results are an improvement. 
Firstly, they relate to any E matrices, not just those generated by an AJ?(1) 
or MA(1) process. Secondly, they require the calculation of the maximum 
and minimum eigenvalues of E only, whereas Vinod and Ullah utilized all the 
eigenvalues in a much more complicated calculation. Thirdly, our bounds 
are independent of A and Aq matrices. 

Note that if the bounds on the positive eigenvalues of AEA and (Aq — 
A)E(Ao — A) can be established they will provide more accurate intervals 
for critical values as can be seen from (4.6) and the fact that 

— > 5“^ and — < S. 

Recall that for the multivariate normal error distribution, Evans and 
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Savin (1982) have derived for | W/n | < 1 the relationship 

W-LRc-LR-RSc- WV(2n), (4.11) 

which generalizes the known inequality: 

W>LR> RS. (4.12) 

Ullah and Zinde- Walsh (1984) have shown that a more complex rela- 
tionship exists between the statistics RS^f, when the distribution 

is spherical but non-normal. Here, once again, straightforward inequalities 
relating the bounds on the statistics can be derived. 

For any of the statistics F, LR, RS or W, denote the upper and lower 
bounds given by (4.9) and (4.10) by upper or lower bars. Next define Fu 
and Fl as follows: 

Fu = (F- F)/F and Fl = (F - F)/F. (4.13) 

In a similar notation, define LRu and LRi,^ RSu and RSl, and Wu and 
Wl, These ratios show the length of the interval between the bounds in 
relation to its upper and lower point, respectively. Thus, they measure the 
“tightness” of the bounds on the critical values of the statistics, and the 
following theorem establishes a, ranking of the statistics with respect to this 
characteristic. 

Theorem 3. The following relationships bold: 

Fu = Wu> LRu > RSu and Fl = Wl> LRl > RSl> (4.14) 



Proof. See Section 5. 

This theorem demonstrates the relative robustness of the bounds on 
critical values for the different statistics. The bounds are tightest for RS 
and are worst when the F-test or the W-test is used. 



5. PROOFS OF THE LEMMAS AND THEOREMS 



Proof of Lemma 1. From (2.9) and (3.1) we can write 



= 



- Xo^oe) ^ ^ { y - ^o^oe) 



/ 



(5.1) 
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We transform the denominator of (5.1) by substituting 










and obtain 


y'S-i [/ - A" y. 


(5.2) 


Consider 




(5.3) 


where = (^ “ 


H) with 1 y'Hy | < y'y. 





We can expand part of (5.3) in a geometric series as follows: 

X[X\I - H)X]~' X' 

= X{X'X)-i [l - {X'Xy^X'HXiX'X)-^] {X'X)-ix' 

= P + PHP + PHPHP + • • • + P(FP)* + • • • , 
where P is defined by (2.6). If one substitutes into (5.3) one obtains 

I-H-P- PHP - ... - P{HP)^ 

+ HP+HPHP + --- 

+ {HP)^ + HPH - HPHPH H{PH)^ 

4- + PH PH + • . . 

+ {PH)^ + . . . = A - AHA - AHPHA AH{PH)^A 

Indeed the last term is obtained in the following way: 

-P{HP)^HP + {HP)^HP - H{PH)^ + P[HP)^H = -AH{PH)^A 

where we have substituted P = I — A. We therefore show that (5.3) can be 
represented as follows: 

A - AHA - AHPHA AH{PH)^A (5.4) 

Further, we can replace P by / ~ A everywhere in (5.4) to obtain 

A - AHA - AH^A + AH AH A + • • • 

-f- (-1)*'’^^AP*^ AP*^ A . . . A + • . • (5.5) 



This formula can be easily verified by substitution of P = / ~ A into (5.4). 
The numerator of £e can be transformed in an analogous manner. This 
concludes the proof of Lemma 1. 
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Proof of Lemma 2. For a given vector y we define mutually orthogonal 
unitary vectors yo, yi, and y 2 , such that 

ViVi = 1, » = 0. 1. 2; ViVi = 0 for all 0 < » < j < 2; 

Ay = ayi; {Ao - A) y = fiyo; and Poy = myi- (5.6) 

If one substitutes into (3.4) one obtains: 

a* {a^y[Tyi + a^y[Tyo + a/SyjTyi + fi^y'oTyo) 

= (a* + ;d*)(a*ylryi). (5.7) 

We equate the coefficients of all the monomials in a and ^ in (5.7) to 



obtain: 




y[Tyi = y'oTyo 


(5.8) 


and 




y'lTyo + y'oTyi = 0. 


(5.9) 


Since T is symmetric, (5.9) implies that 




y[T yo = yoTyi = 0. 


(5.10) 



Conditions (5.8) and (5.10) hold for any y. We can denote y^Tyi by 
where ^ is a constant. For any y, 



y[Tyi = y'ATAy/a^ = 9, 

where a* = and we have y'ATAy = By* Ay, 

Similarly, using (5.10) in addition to (5.8), we can show that 

y'AoTAoy = ^y'Aoy. 

Proof of Theorem 1. Consider the expression (3.2) for We can write 
it as follows: 

. ^ 14 - y'ApT^Aoy/y'Aoy 
^ 1 + y^ATpAy/y^Ay 

If ^ it follows that 

(y'AoT°Aoy) y'Ay = {y'AT.Ay) y'Aoy. (5.12) 

The expressions on each side of (5.12) are series in the parameters pi, . . . , 
of Hp. If the two series of the right and left sides in (5.12) are to be identical, 
all the coefficients of monomials have to coincide. Consider the coefficient 
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of Pi in Tp, it is some symmetric matrix T\ (it is also the coefficient of pi in 

Hp). 

We get for T\ 



{y'AoTiAoy) y'Ay = {y’ATiAy) y’Aoy, 
thus, by Lemma 2, 



AqTiAq — OiAq (and AT\A = OiA), 

Similar equalities hold for all coefficients of p 2 > • • • in Tp. 

Any coefficient of a monomial . . . p*** in can be represented as the 
sum of such a coefficient in 



ri -j-r aH hrjb 

E 

r=l 

denoted by T(ri, . . . , rjfc), and products of coefficients of lower power mono- 
mials with Ao in between. We can now use induction to show that 

AoT (ri , . . . , ffc) Ao = ^ (ri , . . . , r*) Aq. (5.13) 

If (5.13) holds for all coefficients of monomials of lower power, we can replace 
such AqTAq by the appropriate OAq and will arrive at (3.4). 

This concludes the proof of Theorem 1. 

Proof of Lemma 3. Let T» € 0(n) be a matrix that diagonalizes Si A*Sa ; 
that is, = A*, where A^ is diagonal. Denote by Qi the orthog- 

onal projector onto the space of non-zero eigenvectors of SiA^Sa . 

We show that Qi and Q 2 are mutually orthogonal; thus each T» can be 
represented by the same matrix Mi -f M 2 + M 3 with M* mutually orthogonal, 
with the columns of Mi (M 2 ) formed by the orthonormal system of non-zero 
eigenvectorsof E^ AiSi(Si A 2 E 5 ). Suppose for some vector E 5 A^E^^ = 
with A 0. Then Qi^ = ^ We have A*E^ = AE'a^ = = 

AA»E“5^. Therefore A*E”5^ = Clearly then for any rj such that 

rj = Q^rj we have Ail^~if) = For a vector ry for which Qirf = 

Q 2 TJ = rj one would have 



T) = AjS-^ = AiS-i»? = AjE"*) r) 

= S5A2AiS-^ = 0, 
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since A 2 A 1 = 0. Therefore, Qi and Q 2 are orthogonal projectors. 
This concludes the proof. 



Proof of Theorem 3. We represent all the bounds as functions of 5 = 5cr, 
by combining (2.11), (4.9) and (4.10): 

F = [q/r)8S\W = n6S;LR = nlog(l + 6 S);RS = nSS/{l + 6S); 

F = {q/r)8-'^S; W = LR = n\og{l + 5 "^ 5 ) ; 



RS = n8-^S/ (1 + 8~^S) . (5.14) 

Next we derive directly that 

Fu = Wu = l- 8 -^;Fl = Wl = 8^--1; (5.15) 

Wu - LRu = In (1 + 8-'^S) / ln(l + 8S) - 8^ (5.16) 

Wl - LRl = 6^- ln(l + 8S)/ ln(l + 8-^S); (5.17) 



LRu - RSu = ^“*(1 + SS)/{1 -H 8-^S) - In (l + 8^^S) / ln(l + 8S); 

(5.18) 

LRl - RSl = ln(l + 8S)/ In (l + ^“^5) ~ (l + 8~'^S) /(I + 8S), 

(5.19) 

It immediately follows from (5.15) that whatever conclusions will be 
proved to hold here with respect to W will apply to P as well. 

Examine (5.16). The expression ln(l 8~~^S) — 8’~^ ln(l + 8S) is always 
non-negative since it equals zero for 5 = 0 and its derivative with respect to 

5 is ^ ^ 

8-^ {8S - ^- 15 ) 

(l-f^“i5)(l + ^5) 

and is thus positive. This proves that (5.16) is positive for positive 5. Simi- 
larly, we show that 8^ ln(l -j- 8^^S) - ln(l + 8S) is positive for 5 > 0; thus, 
(5.17) is positive. 

Next, consider the expression 

^-^(1 + 8S) ln(l + 8S) - (1 + 8-^S) ln(l + 8-^S) 



related to (5.18). It is zero for S = 0; its derivative is equal to 



In 



1 + 8S 
1 + 8-^S 



and is positive. Therefore, (5.18) is positive. Similarly, since 
(1 + 8S) ln(l + 8S) - ^*(1 + 8-^S) ln(a + ^~^S) 
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is positive (identical proof), it follows that (5.19) is positive. 

This concludes the proof of Theorem 3. 
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NONPARAMETRIC INFERENCE IN ECONOMETRICS: 
NEW APPLICATIONS 

ABSTRACT 

In this paper, nonparametric estimators of a multivariate density, its 
conditional mean (regression function) and its conditional variances (het- 
eroskedasticity) are presented. Among other results, we establish central 
limit theorems for the estimators and build up confidence intervals based on 
these estimators. These techniques are applied to obtain new results in two 
areas of econometrics: Monte Carlo investigations of the exact distributions 
of test statistics, and the treatment of heteroskedasticity in linear regression. 



1. INTRODUCTION 

In econometrics and in many other scientific disciplines (such as med- 
ical sciences, sociology and psychology) one often has to deal with several 
variables simultaneously, each in some sense dependent on the others. A 
common inference problem in such sciences, especially in econometrics, is 
to see how a particular variable on the average is dependent on others, so 
that prediction (estimation) of the value (or average values) of the variable in 
question can be made at any specified values of the other variables. A second 
common inference problem in such sciences, though somewhat related to the 
first one, is to see how the chosen variable varies (over various spots, items, 
individuals as the esse may be) when other variables are held fixed at cer- 
tain specified values of interest. The first problem, known 2 is the regression 
problem, and the second problem, known as the heteroskedasticity problem 
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in regression, are invariably handled in various sciences by postulating a cer- 
tain fixed model (functional form) for the regression and by assuming fixed 
conditional variance (homoskedasticity) of the variable in question. How- 
ever, it is now well known that the set of all suitable functional forms of the 
regression (or of the distributions of disturbances) is quite often large, and 
any postulations regarding the form of the regression and the value of the 
conditional variance (the variance of the disturbances in the regression) are 
questionable, and their violations have varying effects on the econometric 
inferences and policy implications. 

The only way of avoiding the misspecification of the functional form of 
the regression model or of the conditional variance is, in fact, to assume no 
specific parametric functional form of the regression or of the conditional 
variance; and to estimate the conditional mean and the conditional variance 
completely nonparametrically. This in turn can be achieved by estimating 
nonparametrically the joint probability density function (p.d.f.) of all the 
variables involved. For example, we can estimate the conditional mean and 
variance of a variable xi given p — 1 other related variables (x 2 , . . . , Xp), if we 
can estimate the joint p.d.f. of (xi, . . .,a:p) and then the conditional p.d.f. 
of xi given (x2,...,Xp). 

Nadaraya (1964), Watson (1964), Rosenblatt (1969), Noda (1976) and 
Collomb (1979, 1981) are among the first to consider estimation of a regres- 
sion function nonparametrically using Rosenblatt (1956), Parzen (1962) and 
Cacoullos (1966)-type kernel estimates of a density function. 

In this paper we present nonparametric estimates of a multivariate den- 
sity, and of the conditional mean and the conditional variance of a variable 
given the others. These techniques are then applied to produce new results 
in two areas of econometrics. The integration of density estimation with the 
Monte Carlo technique of doing finite sample econometrics is explored. Also, 
the nonparametric estimate of the variance is used to analyze the problem 
of heteroskedasticity in linear regression. 

The plan of the paper is as follows. In Section 2 we present the method 
of constructing nonparametric estimates of a multivariate p.d.f. and the 
conditional mean and the conditional variance of a variable given the others. 
In Section 3 we state various results with regard to the consistency, the 
variance and the distribution of each of these estimators. The confidence 
intervals for the joint density, the conditional mean and the conditional 
variance are also presented. In Section 4 we give proofs of the main results 
stated in Section 3. Finally, in Section 5 we illustrate the performance of 
our estimators through applications to certain econometric problems. 
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2. ESTIMATORS OF THE JOINT P.D.F., THE CONDITIONAL 
MEAN AND THE CONDITIONAL VARIANCE 

Suppose we have n independent observations wt = (icti, . . . , xtp), t = 
1 , . . . , n on p random variables xi, . . . , Xp of our interest. We wish to draw 
an inference about the conditional mean of a variable, say xi, given the 
remaining variables X2, . . .,iCp. As mentioned in Section 1, we achieve this 
by first considering the nonparametric estimation of the joint p.d.f., say /, 
of X = (xi, . . . , Xp), and then considering the estimation of the conditional 
density, say y, of xi given x' = (x2, . . . , Xp). 

Throughout the remainder of this paper, we denote / (wi,w') dwi 

by £y(w'), where w' = («; 2 , • • • , is a point in the (p - 1 ) -dimensional 
Euclidean space We shall, however, use only and £ 4 . Notice 

that £0 is the marginal p.d.f. of x' = (x 2 , . . . , Xp); y (u;i \ w') = /(w)/£o(w') 
is the conditional p.d.f. of xi at lyi given x' = w'; 

Af(w') = £i(w'/£o(wO) 

is the conditional mean of xi given x' = w'; and 

V(w') = (4(w')/A)(w')) - M^(w') 

is the conditional variance of x given x' = w'. 

As by Singh (1981), for an integer s > 1 and for 1 = 1, . . .,p, let AT/ be 
the cleiss of all Borel-measurable real valued bounded functions, on the real 
line, symmetric about zero, such that for a if* € K/, 

/«,).(»= {j (21) 

/ I y‘Ki{y) I dy < 00 and [ yKi{y) |-> 0 as | y oo. For example, for s = 2 , 
take Ki{y) = |/(-l < y < 1 ) or (2;r)“a exp(-y*/2) for all •; and for s = 3 
or 4 take ff<(y) = (2;r)“5[2exp(-y*/2)-2“5 exp(-y^/4)] 7(-oo < y < oo) 
or ( 2 ;r )“5 (|)(3 - y*) exp(-y*/2) 7(-oo < y < oo) for all i. Other examples 
of function Ki are given by Singh (1979, 1981). Define K on by 

K {yi,- ■ .,yp) = Ki (yi) K 2 (y 2 ) •••Kp (yp) 

=n^*(V‘)- 

t=l 

Remark 2.1. We have chosen the above kernel only for the sake of simplicity. 
All the results of this paper remain valid if K in the various results is replaced 
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by a more general if, namely a Borel-measurable, real valued, bounded 
function on symmetric about the origin such that: 

j yf , • • • , (j/i, • • • , !/p) = 1 or 0 according to whether 

ii = • • • = Jp = 0 or 0 < H h jp < s - 1; 

j I yf • • ■y^p’^K (yi, . . . , yp) I < 00 if ii + • • • + Jp = s 

and II y II I K{y) |-+ 0 as || y ||— ► oo where || y || is the usual 
Euclidean norm on R^, 



For I = 1, . . .,p, let 0 < hi = /i*(n) be functions of the sample size n 
such that h,- — > oo as n oo. (A suitable choice of h^’s will be given later.) 
We estimate the joint p.d.f. of x = (xi , . . . , Xp) at w = (lUi , . . . , Wp) by 




Singh (1981) and Singh and Ullah (1984) have considered the estimator (2.2) 
with hi = • • • = /ip = h. For other estimators, we define, for j = 0, 1, 



£y(w') = n ^ 



= n 



-1 



E(4){n(c..(^))) 



where = (*42 , .... Xtp) ,~w' = ( 1 V 2 , . . . , Wp), and 



K' 



(^)=n(^-( 

' i=2 ^ ^ 




A nonparametric estimate of the marginal p.d.f. of x' = (x2, . . - ,Xp) evalu- 
ated at w' is therefore 
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The estimate of the conditional density of xi at wi given x' = w' is 



9 (wi 1 w') = 



/(wQ 

io(w')' 



With the estimate of the conditional p.d.f. of Xi given x' in view, a non- 
parametric estimate of the conditional mean M(w') of xi given x' = w' is 
therefore 

M(w') = f Wig (wi I w') dwi = (2.3) 

J £o(w') 

Finally, our estimator of the conditional variance V (w') of xi given x' = w' 
is 

^ “ M=“(w'). (2.4) 

A)(w') 

We will show in the next section that the statistics /, M and V are consistent 
estimators for /, M and V respectively. 



3. PROPERTIES OF ESTIMATORS AND CONFIDENCE INTERVALS 

Under certain regularity conditions on / we show in this section 
that the statistics /(w),Af(w') and V(w') are consistent estimators for 
f{w),M{w^) = E{xi I x' = w') and V(w') = var (xi | x' = w') respec- 
tively. We further obtain the variances and the estimates of the variances of 
/, M and V. We also prove the asymptotic normality of these estimators. 
Finally, using the estimates of the variances of /, M and V and their dis- 
tributional properties we obtain 100(1 — a)% confidence intervals for f,M 
and V. Proof of the results will be presented in Section 4. 

Theorem 3.1. Let all the sth order partial derivatives of f be continuous 
at w. Then taking 

hi oc (3.1) 

we have 

/(w) = /(w) + Op(n-*/(*‘+»’)); (3.2) 

and with 

p 

an = nY[hi (3.3l 

»=i 

Of» var(/(w)) = Ao(w) + o(l), 



we nave 



(3.4) 
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where 








Ao(w) = /(w) j K^; j = '[[ (^j Kf{y)dy^ 


(3.5) 


and 


(/(w) - E{f(yr))) N (0, Ao(-w)) , 


(3.6) 


provided a^, 


— ► oo. 




Remark 3.1. Since, as we will see in the proof of Theorem 3.1, 






■®[/(w)] = /(w) + O (max {AJ, . . . , F*}) , 


(3.7) 


if we take /i, 


[’s so that 






(max{AJ,...,Ap}) = o(l), 


(3.8) 


for example take hi^s proportional to n «) for any 6 > 0, then 




(/(w) - /(w)) ^ AT (0, Ao(w)) . 


(3.9) 


Theorem 3.2. Let the sth order partial derivatives of £q and £i 
nous at w'. Further, let £2 be continuous at w'. Then taking 


be contin- 




A<ocn-V(2»+P-i), 


(3.1)' 


we have 


M(w') - M(w') 


(3.10) 


and with 


p 






«'n = « n 
»=2 


(3.3)' 


we have 


a'^ var(M(w)) = Ai(w') + o(l), 


(3.11) 


where 







= [var(*ilx' = w')/4(w')l/(iir'r, 



(3.12) 
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with 

/ {K'y = n (/ ^hy)<^y ) ; 

and 

{a'J^ {M(w') - £;[M(w')]} - N ( 0 , Ai(w')) , (3.13) 

provided — > oo. 

Remark 3.2, From the proof of Theorem 3.2, it follows that 
J5[M(«;')] = M{vu') + O (max h"}) . 

Therefore, if Aj’s are chosen in such a way so that 

(a(,)5 (max{h5,...,h*}) = o(l), 

that is, take h<’s proportional to for any e > 0 , then 

a'„ (M(w') - M(w')) ^ iV( 0 , Ai(w')) • (3.14) 

Theorem 3.3. Let the sth order partial derivatives of £ 0,^1 and £2 he 
continuous at w'. Further, let £4 be continuous at w'. Then taking as 
in (3.1)', we have 

V(w') = K(w) + Op(n“*/(*®+P“^^) (3.15) 

and 

o'„ var(y(w')) = A 2 (w') + o(l), (3.16) 

where 

. ^ (^4(w')/^o(w0) - (^ 2 (W')/^(W'))^ 

= [ v„ (*■ I =. „') /£o(w')] J (X')> (3.17) 

and 

(a;,)» {^(w') - A;[F(w')]} - N ( 0 , Aa(w')) • (3.18) 

Remark 3.3. It will be seen in the proof of Theorem 3.3 that £?(y(w')) = 
V (w') + O (max {hj, . . . , h*}) . Thus if the hi satisfy the hypothesis of Re- 
mark 3.2, then • 

(a;.)^ (y (w') - y (w')) - N (0, Aj(w')) . 



( 3 . 19 ) 
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The computable confidence intervals of /, Af and V , respectively, can easily 
be written from (3.9), (3.14) and (3.19) after replacing / and £y, j = 0, 1,2 
and 4 by their consistent estimates / and £y. 

4. PROOFS OF THE THEOREMS IN SECTION 3 

In this section we give proofs of our theorems in Section 3. 

Proof of Theorem 3.1. Since (a:ti, • • • , a:tp), t — l,...,n, are i.i.d. with 
joint p.d.f. /, then if we take the expectation of /(w) in (2.2) with respect 
to the joint distribution of x = (iCi , . . . , Xp) and use the transformation 
theorem, we obtain 

■®[/(w)] = j if(y)/(w + hy)dy, (4.1) 

where (w+hy) = {wi -h hij/i, . . . , it;p + hpVp). Now, if we replace /(w+hy) 
by its Taylor-series expansion at w with Lagrange’s form of the remainder 
at the sth stage, apply the properties of Ki, and then use the continuity of 
the sth order partial derivatives of / at w, we obtain by the arguments in 
the proof of Theorem 1 of Singh (1981) that 

^[/(■w)l = /(w) + O (max {h{,. . .,h‘}) . (4.2) 

Further, since (a^ti, . . . , Xtp) for ^ = 1, 2, . . . , n are i.i.d. with joint p.d.f. /, 

var(/(w)) = var 

= + hy) - (^^[/(w)])^ 

= (n V') [/(^) / + ‘^(1) ’ (4-2) 

where f = nf=i (/ ^^st equation follows by arguments 

used to prove Theorem 2.2 of Singh and Ullah (1984). Now (3.4) follows 
from (4.3). 

Now (4.2) and (4.3) followed by (3.1) prove that 

MSE(/(w)) = 



(4.4) 




NONPARAMETRIC INFERENCE 



261 



which, with an application of Chebyschev’s inequality, proves (3.2). 

To prove (3.6), let 

i (nv‘) {- (^) - ^ (^)] } /(-(/V)))*. 

where / \ / N 

Then L„i, . . . , Lnn are i.i.d. centered random variables with 

t=l 



and var (5^^) = 1. Temporarily let ^(*) denote the distribution function of 
the standard normal random variable. Then by the Berry-Esseen Theorem 
(see Chung, 1974, Theorem 7.4.1), 



sup I P (S„ < 0 - HO I < C ^ P I L„t |3, (4.5) 

where c is an absolute constant. But by the inequality given by Loeve (1963, 
p. 155), 

E\Lntf<4[ var(/(w))]"’ f[ j E \ K • 

Since 

\^ = j \K\^f{w + hy) 

= /(w) I \Kf +0(1), 
and an var(/(w)) = Aq(w) + o( 1 ), we see that 

f^E\L„t |®=0(o;;^). 

t=l 




262 



R. S. SINGH, A. ULLAH AND R. A. L. CARTER 



Thus, we conclude that 



sup I P 
i€R 



/(w)) - g(/(w) ^ ^ 
(var (/(w))) “ 



-$(t) I =0(o„5). 



This result and (3.4) give (3.6). 



Proof of Theorem 3.2. Throughout this proof £qj ^ 2 , and £2 

are evaluated at w' = (w 2 ^ . . . , Wp) G and therefore the argument w' 

in these functions will not be displayed. From the proof of Theorem 3.1, and 
the hypothesis on £ 1 , it follows that 



E[to) = Iq+0 (max {hi, . . .,h’}) , 

and 

var(lo) = j {K'Y + o(l) 



(4.6) 



Thus with the choice of /i»’s in (3.1)', MSE(£q) = O (n 2 «/( 2 s+p 1 )^^ 
hence 

4 = 4+Op(n-2/(2-+P-i)). (4.7) 



Similarly, in view of the hypothesis on £1 and £ 2 , it follows that 
£?(ii) = «i + 0(max{ft^...,/i;}); 

and 

var(£\) = (a;,)-‘[£,|(r)^ + o(l) 
and, hence with (3.1)', 

4=^+Op(n-(**+'’-^)). 



(4.8) 



(4.9) 



Now we evaluate the cov(£q,£i). Recall that xj = {^t 2 i • • • ^^tp) and 
if' (1/2, • • • ) Vp) = rii=2 (j/O* Since the summands in Iq are i.i.d., as are 
the summands in £ 1 , 
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where 



But the r.h.s. of (4.10) is 

j yi{{K'{y)Yfiyi,W2 + h2y2,...,Wf, + hpyp)dy- 



M=2 



= e^j{K'y + o(J[h^ ; 

this follows by the arguments used to prove the first part of Theorem 3.1. 
Thus, we conclude that 

a'n cov{lo,£i) = t-ij {K'Y + o(l). (4.11) 

Now, if we write 

M = k/io 
_ E{h) 

E{io) 

and apply (4.6)-(4.9), we get (3.10). Now (4.12) followed by (4.6)-(4.9) and 
(4.11) gives 



1+ 



h - E{h) ip - E(io) 



E{ti) 



E(£o) 



+ Op(a'J 



-1 



(4.12) 



ef 



var(M) = 



e2f(K'y 

= Ai + o(l) 



HKr 

A) 



+ 0(1) 



from the definition of Ai in (3.12). Thus the proof of (3.11) is complete. 

Now we prove (3.13). From the arguments used to prove the asymptotic 
normality of /, it follows that 

(a'J» {to - E{£o)) -*N{0, to I (if')') (4.13) 

and 

(a;.)» (£i - E{ii)) ^ N{0, t2 j{K‘Y). 

Now (4.12) followed by (4.11), (4.13) and (4.14) give (3.13). 



(4.14) 
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Proof of Theorem 3.3. As in the proof of Theorem 3.2, throughout the 
proof of this theorem, £y,y,M,£y,V' and M are evaluated at w', the point 
considered in Theorem 3.2. 

By arguments identical to those used in the proof of Theorem 3.2, it 
follows that 



E{t 2 ) = £2 + O (max {h^, , 

and 

4 = £2 + (4.15) 

with the hi^s taken as in Theorem 3.1. Further, arguments applied to prove 
(4.11) can be used to show that 

a'„ cov(£o,£ 2 ) = £2 / {KT + o{l). (4.16) 



Hence, if we write 

h ^ E{h) £2 - ^(£ 2 ) £0 - E{io) ^ 

£0 E{io) [ E{h) E{io) " 

and use (4.6)-(4.9), (4.15) and (4.16), it follows that 

Eihih) = (£2 /£o) + o (max {hi . . ., A^}) 



and 



a' r rf£ '£1 " (^l/^o)] /(£f')* , 

Gn Var(£2/foj - 



— A2 + 0(1). 

Hence, by Chebyschev’s inequality. 



(4.17) 



(4.18) 



Y = ^+Op(a'J-i. (4.19) 

Moreover, by the arguments used to prove the zisymptotic normality of /, it 
follows that 

(o'J» (£2 - E{t 2 )) - N{0, £4 1 {K'Y ) . (4.20) 
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Thus (4.17) followed by (4.13), (4.16) and (4.20) proves 



((^/^) - Eih/io)) - iV(0,A2). 



(4.21) 



Now we obtain the results of Theorem 3.3 with regard to V . From (4.12), 
(4.6)-(4.9) and (4.11) it follows that 



^ ’ E{lo) 

= (V£o)^[l + 0(a;.)-^] 

Hence from (4.17) it follows that 



j var(£i) var(£o) _ 2 cov(4,£q) ^ , i 

(£;(4))^ (A?(£o))^ mh))E{to) 



(4.22) 



E{V) = (£ 2 /£o) - (£i/£o)" + O (max {hl,...,h;}) 

= y + 0 (max{/iJ,...,Ap}) , (4.23) 

and from (4.19) and (3.10) it follows that 

V^V + 0M~K ( 4 - 24 ) 

which proves (3.15). Now, from (4.12) and (4.6)-(4.9) it follows that 

V = {h/io) -M^ = [h/io) - + Op (a'„)-' . 

This result followed by (4.18) proves (3.16), and followed by (4.21) proves 
(3.18). 



5. APPLICATIONS 

In this section we consider the econometric applications of the estimation 
of densities and variances. 

5.1 Estimating the Density Functions of Exact Sampling Distributions of 
Econometric Estimators 

An important application of the kernel estimator is in estimating the 
density functions of the exact sampling distributions of econometric estima- 
tors and test statistics which are nonlinear functions of the endogenous data. 
Such estimated density functions are useful directly, such as for estimating 
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the true size of asymptotic tests, and indirectly as input to the extended 
rational approximants (ERA’s) of Phillips (1983). The aim of this section is 
to illustrate the technique with an example which is simple but which also 
has wide applicability and allows the production of new results. 

Assume a data generating process (DGP) or joint p.d.f. of the form: 

yit = + 7V2t + txit, (5.1) 

t/2t = + ^22iC2t + V2t; t=l,..., T. (5.2) 



Equation (5.1) is a structural equation containing the parameters of interest, 
/? and 7, while (5.2) is a reduced form equation showing how the endogenous 
variable 1/2 is generated as a linear combination of the two exogenous vari- 
ables xi and X 2 plus an independent normal error V 2 which has zero mean 
and variance o;22- The error tti is also assumed to be independent, normal 
with mean zero and variance cth. There is also a reduced form equation for 
2/1- 

2/lt = + ^2l3^2t + Vit> (5.3) 

where Vi is independent iV(0,a;ii). 

Because equation (5.1) is just-identified we can write the parameters of 
interest as functions of the reduced form parameters: 



7 = 



and P = TTii — 

7T22 



^ 12^21 

7T22 



(5.4) 



The normality of the errors together with the exogeneity of the x’s 
implies that least squares (LS) applied to (5.2) and (5.3) will produce 
maximum-likelihood (ML) estimators of the parameters from which ML 
estimators of P and 7 can be obtained using (5.4). These estimators of P 
and 7 are also indirect-lezist-squares (ILS) or instrumental- variable (IV) es- 
timators, where x\ and X 2 are the instruments. They are consistent with 
asymptotically normal sampling distributions. Their exact sampling distri- 
butions were given by Basmann et al. (1971), who pointed out that they 
possess no positive integral-order moments. 

Two test statistics of natural interest are: 






BS0) 



and = 



7-7 

as(7)’ 



(5.5) 



where as(y§) and as(P) are the (estimated) asymptotic standard errors of 
P and 7. Asymptotically tp and follow the standard normal distribu- 
tion but their exact distribution appears to be unknown. (Richardson and 
Rohr (1971) derive the exact distribution for similar test statistics in the 
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over-identified case). Although the random denominators in these ratios 
are always positive, the fact that the numerators lack moments makes one 
reluctant to zussume that the ratios have moments. 

Monte Carlo simulation can be used to produce samples of values of 
Pi 7, and tr^. However, meeisures of bias, mean-squared-error and other 
moments computed from such samples are of little relevance, given the non- 
existence of the population parameters. However, nonpar ametric methods 
are well suited to estimating the density functions of 7, and Indeed, 
such density function estimates may well be regarded as more complete and 
useful than moment estimates even if the population moments did exist. 

For the purposes of the Monte Carlo experiment the parameters of the 
DGP were set to the following values: ^ = .6, 7 = .2,7rn = .545455, ;t 2 i = 
.163636, ;ti 2 = —.272727, 7T22 = .818182, T = 20. Values of and x^t were 
generated such that: 

= 0.0 and = 20.0. 

tit t t 

The set of x’s was fixed over repeated samples. Two alternative reduced 
form error covariances were employed. Under the heading ‘‘loose fit”, 
ojii = .535537 and a;22 = 53.5537. When combined with the x’s, these 
values give standard errors such that: s(^2i) = ^21 and 5(^22) = 2?T22, 
i.e., the probability of obtaining a negative value of ^22 is .309. Under the 
heading “tight fit”, ojn = .0360331 and 0^22 = .0826446. These values gave 
population goodness-of-fit measures of .90 for both reduced form equations. 
While this may be a realistic specification, it implies that the probability 
of obtaining a negative it 22 is less than 1.7765 x 10“^^. In both cases the 
covariance between Vu and V2t, c*^i2> was set to 0.0. Two experiments, each 
of 100 replications, were run: one loose fit and one tight fit. 

The random number generator used was a version of Marsaglia’s Super 
Duper generator as implemented by McLeod (1982). 

Each experiment resulted in a frequency distribution for 7, tjs and 
In addition, frequency distributions were formed for ^21 and = 
(^21 ~ ^21) /s (^21), where s (^21) is the (estimated) standard error of it 21- 
While these frequency distributions convey some information about the un- 
derlying sampling distributions, the modest number of replications used 
means they are lumpy, with several empty classes. 

Nonparametric estimates of the densities of P, 7, and were obtained 

from 

(-) 

»=1 ^ 



where: 
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(i) Zi is a value of 7, or obtained by Monte Carlo simulation, stan- 
dardized by subtracting its average over the 100 replications and dividing 
by its standard deviation over the 100 replications. This standardizing 
transformation alters the location and scale of the density but not its 
shape. 

(ii) ^ is a value at which the density is to be estimated. Values of z were set 
between -5 and 5 in increments of .1. 

(hi) h is the window width, 100“^ = .398. 

(iv) The normal kernel is used: 

Of course, the estimates obtained by this procedure embody some sampling 
error. Therefore, when (5.6) was evaluated with Zi formed from ^21 > the 
resulting density function was not exactly standard normal: compare the 
standard normal density in Figure 1 to the estimated density for ^21 from 
the tight fit experiment in Figure 2. (The estimated density of ^21 from the 
loose fit experiment and the estimated densities of for both experiments 
were nearly identical to Figure 2.) Although the density function in Figure 
2 has its peak slightly too far right and is slightly skewed left, it is still 
a very good estimate of the standard normal density function in Figure 1, 
even though it is based on only 100 points. This gives us confidence that the 
estimated densities for 7, and ^ will also be close to their population 
counterparts. 

The estimated density for ^ from the loose fit experiment is plotted in 
Figure 3. The analogous plot for 7 is extremely close to that shown in Figure 
3. Both densities have very high peaks and long, thin tails. The estimated 
density from the loose fit experiment, see Figure 5, looks very similar 
to Figure 2, but the estimated density of is strongly skewed to the left; 
see Figure 6. 

The estimated density of P from the tight fit experiment is plotted in 
Figure 4. (The plot of 7 was very similar to that for ^.) It contrasts sharply 
with the earlier results; the high peaks and long tails are absent. Indeed 
Figure 4 looks very much like Figure 2. Figures 7 and 8 show the estimated 
densities for and ^ when the fit was tight. Now the ^ distribution closely 
resembles the distribution, in contrast to the skewed distribution obtained 
when the fit was loose. 

The estimated density functions presented in Figures 3 to 8 suffer from 
the disadvantage that they are point estimates. One might reasonably ask 
for measures of their precision or, better still, interval estimates. Asymptotic 
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Figure 5. Nonparametric estimate of the density oftp, loose fit experiment. 
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Figure 6. Nonparametric estimate of the density oft^, loose fit experiment. 
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Figure 7. Nonparametric estimate of the density of t^, tight fit for experi- 
ment. 
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Figure 8. Nonparametric estimate of the density ofte^, tight fit experiment. 
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95% confidence intervals for f{z) are given by 



f{z) ± 1.96 




2nhy/n 



with n = 100, the number of replications used in the simulation. These 
confidence intervals are plotted for and for the loose fit experiment in 
Figures 9 and 10. Both sets of confidence intervals from the right fit ex- 
periment closely resembled Figure 9. The standard normal density function 
(Figure 1) lies entirely within those confidence limits for from both ex- 
periments and for from the tight fit experiment. However, it lies outside 
these limits for from the loose fit experiment. 

The nonparametric density estimates presented in this section suggest 
several conclusions. First, the shape of the exact, small sample distributions 
of ILS/IV estimators of the structural parameters of just-identified models 
depends crucially upon the probability that reduced form coefficient esti- 
mates, which appear in the denominators of ratios entering the expression 
for structural coefficient estimates, change in sign. This probability will be 
high if the goodness-of-fit of the reduced form is low and/or when small 
samples are employed. When this probability is high the small-sample dis- 
tributions of the structual coefficient estimators have high peaks and long 
thin tails, i.e., they are much different from their large-sample asymptotic 
distributions. The difference between the small and large sample distribution 
decreases as the probability of sign change decreases. 

The second conclusion, which is of much greater operational significance, 
is that the small-sample distribution of t ratios depends not only on the prob- 
ability of reduced form coefficient estimates changing sign, but also upon 
which structural coefficients enter the t ratio. Those formed from the co- 
efficients of exogenous variables appear to have small-sample distributions 
which always resemble their large sample distributions. However, t ratios 
formed from the coefficients of endogenous variables have small-sample dis- 
tributions resembling the standard normal only if the probability of sign 
change noted above is small, e.g., if the reduced form fits tightly. In other 
cases their shape is distinctly non-normal so that the use of the standard 
normal may yield poor inferences. 



5.2 Estimation of Unknown Variances (Heteroskedasticity) 

Here we analyze the conditional variance of earning (y) with respect to 
experience {z). For simplicity in illustration, we have assumed schooling 
to be constant. Our main interest is to look into the specification of the 
variability in earnings. For this purpose we considered Canadian data (1971 
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Canadian Census Public Use Tapes) on 205 individuals’ ages (for experience) 
and their earnings. These individuals were educated to Grade 13. The 
conditional variance, V[y | z)y in (2.4) was estimated by using the kernel 
function: 




where [zt — z)^ / n is the sample variance of z. 

It is clear from the estimate of conditional variances in Figure 11 that 
the true form of the variability in y with respect to (y | z), is a second 
degree polynomial convex to the z axis. This is consistent with the result of 
Mincer (1974, p. 101). The important point to note, however, is that the 
variability of earnings here has been examined without using the grouped 
data, unlike Mincer (1974). 




Figure 11. Nonpar ametric estimate of the variance of earnings conditional 
on age as a function of age. 

In view of the above finding, we may conclude that y(y | z) is negatively 
related with the nonparametric estimate of E{y \ z) which is, as indicated 
by Ullah (1985), a second degree polynomial concave to the z axis. To see 
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that this is actually the case we estimated the regression oi y on and 
y(y I z). The results were as follows: 

y= 11.649 H- .115z - .OOlz^ - 1.103V^(y | ^), 

(.987) (.039) (.005) (.602) 

where the number in parentheses are standard errors. Note that the coeffi- 
cient of y (y I z) is negative and significant indicating the negative relation- 
ship between E{y \ z) and V[y\z), The above result provides a possible 
alternative specification of the earnings equation with variability as an ad- 
ditional variable. 

The nonparametric estimates oiV{y\ z) can also be utilized to perform 
the generalized least squares (GLS) estimation technique in the earnings 
equation 

y = a fiz + ^z^ + u = X6 + u, 
where X = [l z z^] and 6 — [a ^ The GLS estimator is 

8 = (X'S-^X)-^X'I)-^y, 

where S = diag [V” (ti | ^i) , . . . (u | Zn)\, The least squares (LS) and the 
GLS estimates obtained are: 

LS : y = 10.041 + .173z - .002^^ 

(.518) (.027) (.0003) 

and 

GLS : y = 10.274 + .165^ - .002^2. 

(.498) (.025) (.0003) 

It is clear that the GLS estimates outperform the LS estimates. The impor- 
tant point to note here is that the GLS estimates have been obtained without 
using any assumption about the form of heteroskedzisticity. Carroll (1982) 
h£is also applied nonparametric techniques to this type of model, although 
his kernel and bandwidth are different to what we have used. 
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CONFIDENCE INTERVALS FOR 
RIDGE REGRESSION PARAMETERS 

ABSTRACT 

This paper reviews various alternatives for constructing confidence in- 
tervals for ridge regression (RR) parameters, and illustrates them with an 
example. Among the newer alternatives are bootstrapping and those based 
on Stein’s (1981) unbiased estimate of the mean squared error (MSE) of a 
biased estimator of multivariate normal mean. A simulation study supports 
the validity of the confidence statements beised on Stein’s model as modified 
here for the ridge regression problem. It yields confidence intervals which 
can be more useful and reliable than those based on other methods. 



1. INTRODUCTION 

Although ridge regresion has been used in econometrics by several au- 
thors, there is an impression among some econometricians that reliable con- 
fidence intervals are unavailable. The available alternatives offer a trade off 
between computational expense and power. This paper provides a compre- 
hensive discussion, currently unavailable in the literature, of these alterna- 
tives. A new alternative based on the methods of Stein (1981) is discussed. 

The simplest alternative advocated by Obenchain (1977) is to use confi- 
dence intervals based on ordinary least squares (OLS), even if one uses the 
point estimates from ridge regression. Unfortunately, the OLS interval is 
sometimes meaningless on a priori grounds. For example, Vinod and Ul- 
lah (1981, p. 12) gave an illustration of an OLS estimate of a consumption 
function where the OLS estimate of the marginal propensity to consume out 
of wage income is -0.17, thus having the wrong sign; a corresponding 95% 
confidence interval is (-0.26, -0.8), each point of which is invalid for substan- 
tive reasons. Such an OLS interval is centered at the wrong point; the OLS 
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value of -0.17 and is economically meaningless. When such problems arise 
the best strategy is to look for better specifications. However, there are cases 
where, within the range of specifications consistent with economic theory, 
none yields reasonable OLS estimates. This is where RR is an attractive 
alternative to OLS, and where much of the OLS confidence interval theory 
may be economically meaningless. For examples where the OLS interval is 
meaningful, the practitioner need not consider other alternatives. 

A second alternative is the approximate Bayes method, discussed near 
the end of Section 3. The interval is centered on the RR coefficients, rather 
than those for OLS, and the standard errors are based on the diagonals of the 
inverse of the (X^X-^kl) matrix in the usual notation which is defined below. 
Both Bayesians and frequentists can find philosophical or other reasons for 
rejecting this interval. In my opinion, approximate Bayes method offers 
a quick and reasonable alternative as a first approximation, provided the 
practitioner is willing to assume that RR is appropriate for the estimation 
problem and overlook philosophical questions. 

A purpose of this paper is to propose a third alternative, namely a 
frequentist confidence interval based on Stein’s (1981) use of the unbiased 
estimate of the MSE (UMSE) of biased estimators. If x is a normal ran- 
dom variable with mean its biased estimator is ^x, where ^ is a shrinkage 
factor. Stein’s confidence interval based on a property of normal variables 
is discussed in Section 2, along with some modifications. In the remain- 
ing part of Section 1 we will develop notation, similar to Vinod and Ullah 
(1981), such that the ridge estimator can be thought to be a shrinkage fac- 
tor S times a normally distributed variable, denoted by C in (1.7) below. 
Section 3 explicitly applies the methods in Section 2 to the construction of 
confidence interval for RR; these intervals are illustrated by Raid’s (1952) 
cement data in Section 4. Section 5 discusses a simulation study based on 
the data structure in Raid’s cement data to assess the validity of confidence 
statements. An appendix discusses the case of stochEistic k, 

A fourth alternative is to use bootstrap resampling (Efron, 1982) to con- 
struct a sampling distribution. Efron’s “bias corrected percentile method” 
can then be used to construct confidence intervals. Section 6 explains the 
mechanics of using the bootstrap for RR, illustrates the method with the ce- 
ment data, and also discusses the “qualms” that Schenker (1985) associated 
with using the bootstrap intervals. We note that bootstrap intervals may 
be useful for understanding the sampling distribution, provided we already 
know that RR is a good estimate of the unknown parameter. 

Let us consider the general linear regression model in the standardized 
form as: 

y = X^ + u, E{ii) = 0, £?(uu') = <7* I„, 



( 1 . 1 ) 
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where y is an n X 1 vector of observations on the dependent variable, X is 
an n X p matrix of regressors standardized in such a way that X^X is non- 
singular correlation matrix, u is an n X 1 vector of uncorrelated errors with 
mean zero and common unknown variance The class of “ordinary” RR 
estimators of the p X 1 vector of the parameters P in (1.1) is given by 

6* = {X'X + kI)-^X'y, (1.2) 

where A; > 0; see Hoerl and Kennard (1970a,b). When k = 0 we obtain the 
ordinary lea^t squares estimator 6°. 

The usual unbiased estimate of <7^ is 

8* = (y - X6°)'(y - Xb°)/{n -p-1). (1.3) 

For the following discussion, it is convenient to consider a canonical form 
of (1.1). We use the singular value decomposition (Rao, 1973, p. 42): 

X = (1.4) 

where is an n x p matrix of the coordinates of the observations along the 
principal axes of A", standardized in the sense that WH = L The matrix A 
is diagonal with eigenvalues Ai > A2 . . . > Ap, and G is the p X p matrix of 
eigenvectors y» satisfying X*X = GAG', and G'G = /. 

From (1.1) and (1.4) consider the canonical model 

y = + u = + u, (1.5) 

which defines a parameter vector 7 = G')0. The OLS estimate of 7 is denoted 
by G°. Under the normality assumption for u we have 

C°~ATp[7,a^A-^], (1.6) 

a p- variate normal variable. Denote the elements of G° and 7 by G? and7» 
respectively. For the canonical model (1.5) we have = G°'G'GG° = 
G°'C°. Now transform to G where 

^-1a1/2c' 0 = C~ JVp(7,/), 

where 7 = <r“^A^. This follows from (1.6). 



(1.7) 




282 



H. D. VINOD 



2. A PROPERTY OF THE NORMAL VARIABLE 

Let X ^ N{^, 1). The usual unbiased estimator of ^ is x. The shrinkage 
estimator of ^ is 6x where 0 < ^ < 1 is a shrinkage factor. Stein (1981) 
discussed shrinkage in terms of x + /(x), where / is almost an arbitrary 
function. We choose f{x) = {6 — l)x. Straightforward calculation gives the 
MSE of Sx as follows: 

E{6x - = E{8^x^ + - 26x0 

= + + 

= 8 ^ + { 8 ^ 1 ) 2^2 ( 2 . 1 ) 

An unbiased estimate of MSE of 8xy denoted by UMSE (^x), is obtained 
by substituting x* - 1 as the unbiased estimate of 0 (21) • Since 0 is non- 

negative, whereas x* — 1 can be negative, we consider only a non-negative 
(positive part) estimate max[0, x^ — 1] giving us two equivalent forms: 

UMSE(^x) = 8^ + {8- 1)2 max[0, x^ - 1], (2.2a) 

and 

UMSE(^x) = max[^ 2 ^ (^ _ + 2 ^ - l]. (2.26) 

It can be shown following Baranchik (1970) that using a positive part esti- 
mate reduces the MSE of the estimator. 

Stein (1981) proposed the following ingenious device for obtaining a con- 
fidence interval for the biased estimator (^x). He suggested that we con- 
sider the expectation of the squared difference between UMSE(^x) and the 
squared error {8x — 0^- motivation for the squared error will become 
clear when we define the confidence intervals in (2.9) and (2.10). In the 
absence of the truncation implied by the max function, we can derive this 
expectation by direct expansion: 



= 8«*-85 + 2 + 4(5-l)*e* 

= 2(25-l)* + 4(«-l)V, (2.3) 

where we use the relations 

E{x*) = 3 + E{x^) = 3^ + (2.4) 

and note that the terms involving 0 C* cancel. An alternative derivation 
based on Stein (1981) involves integration by parts, which is applicable for 
the more general case when 8 is stochastic. 
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An unbiased estimate of the right-hand side of (2.3) is given by replacing 
by [x^ - 1). Our modification lies in the suggestion that the non-negative 
number be estimated by max[0,a:^ - l]. This yields 

Var = max [2{26 - 1^,46^ -2 + 4{6- l^x^] (2.5) 

as an estimate of £?[UMSE(5x) - (6x - 0^1^ > which is almost unbiased. 

Stein (1981) approximated a multivariate analogue of 

[(« - + 25 - 1 - (5a: - $)*] V - 2 + 4(5 - l)*x*] (2.6) 

as a (central) chi-square variable with one degree of freedom, Xi- Stein’s 
approximation is obviously correct when 5 = 1. Note that for 5 = 1 the 
ratio in (2.6) is simply [1 - (x - iYY /2, It can be verified by simulating a 
large number of empirical cumulative densities that the square root of this 
(+ or -) may be approximated by a unit normal iV(0, 1) variable. For the 
Xi approximation, it is imperative that the square root of (2.6) be a real 
finite number for all real finite values x and f, and for all | 5 | <1. 

Stein’s approximation can be poor when 5^1. For example, when 
5 = 0 (2.6) becomes (x^ - 1 - ^^)^/(4x* - 2), which cannot be correctly 
approximated by a Xi variable. Its square root cannot be a unit normal 
variable. Note that (4x^-2)^/* is imaginary for x^ < 1/2. When (4x^-2) = 
0 and 5 = 0 the expression (2.6) becomes infinitely large. 

Consider the case when 5 = 1/2 for which the square root of (2.6) 
becomes 

T " (f " ^)1 

Now the numerator ^(x — f) is a normal variable with mean zero and variance 

The term (x^ - 1) in the denominator of (2.7) is simply an unbiased 
estimate of Note that when 5 = 1/2, Var given by (2.5) equals max 
(0, x^ — 1) which is zero for x^ < 1. Hence it is obvious that we should 
impose a positive lower bound for Var. If we think of (5 — l)^x^ + 25 — 1 as a 
non-central chi-square variable with one degree of freedom and note that if a 
normal approximation is desired, its variance is unity; see Johnson and Kotz 
(1970, Ch. 28, eq. (23.2), p. 140). This suggests that Var > 1 is appropriate, 
which is confirmed by considerable experimentation with simulated values of 
the ratio in (2.6). Hence we define an almost unbiased estimate as follows: 

SD^ = max [l, 2(25 - 1^,48^ - 2 + 4(5 - ifx^] . (2.8) 

Further explanation of, and motivation behind the lower bound 1 in (2.8) 
may be obtained by regarding 17(5x) as a singly truncated normal variable; 
see Johnson and Kotz (1970, Table II, column 6, p. 86 of Chapter 13). 
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Now we write our modification of (2.6) as follows 

[UMSE(5a:)-(5x-0*]V('5£'*), (2-9) 

where UMSE(^x) and SD are given by (2.2b) and (2.8) respectively. Unlike 
(2.6), the square root of this ratio is a real and finite number for all | 5 | < 1, 
and for all x and ^ values. 

As a further refinement one can improve the Xi approximation by con- 
sidering a simulated empirical distribution function of (2.6) or (2.9) under 
the null hypothesis ^ = 0 for each 6, Since computer generation of unit 
normals is straightforward one can reach any desired improvement over Xi 
by considering a large enough simulation. 

Now we develop the confidence intervals based on (2.9). For brevity 
denote U = UMSE(5a:), M= {6x - f)^, 5 = SD^ and write the probability 
Pr[{U - MY/S^ < w^] as follows: 

Pr[l U-M\<wS] 

= Pr[£f --wS <M <U + wS] 

= Pr[(U - <\Sx-^\<{U + wSy^^] 

= Fi[6x ~ (U - <^<6x + {U^ (2.10) 

This shows that, if w is known we can determine a confidence interval for 
^ centered aX 6 x, U 8 = 1 the shrinkage estimator 6x equals the maximum 
likelihood estimator x, and 

Pr[x - 1.96 < f < X + 1.96] = 0.95 (2.11) 

defines the usual confidence interval. If 5 = 1, we have U = 1 and = 2. 

Hence we can determine ic; so that {U + = 1.96. We find that 

w = 2.0092 yields the correct 95% confidence interval for ^ = 1. Using this 
choice of w we write the upper and lower bounds for our UMSE intervals as: 

UP{x, 5) = «x + (t/^ + 2.00925)^/*, 

LO{x, S) = 6x-{U + 2.0092S)^/^ 



where 



U = max(^^, (^ - l)^x^ +2^-1) 

S = [max(l, 2(2fi - 1)*, 46^-2 + 4(S - l)*x*)] . (2.12) 



and 
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We may define the shrinkage factor 6 leading to the estimator to 
be “too small” if the confidence interval centered at 6x is wider than the 
normal confidence interval (2.11). Denote to be such that for all 

0 < 8 < 8 ^^^ 

UP{x, 8) - LO{x, 8) > 3.92. (2.13) 

Table 1 reports 8^^^ values correct to the nearest 0.001 for a set of 
X values. Table 1 also reports the so-called “ridiculously small” shrinkage 
factors defined by 

gRID ^ j j gg^l ^ IJ-1 (2.14) 



Table 1. ‘Too Small’ ^and ‘Ridiculously Small’ ^ 
Shrinkage Factors for a Unit Normal Variable 



X 


gRID 


^SML 


X 


gRID 


^SML 


1.2 


NEG® 


0.073 


7.0 


0.720 


0.959 


1.5 


NEG 


0.328 


8.0 


0.755 


0.969 


1.8 


NEG 


0.496 


9.0 


0.782 


0.975 


1.9 


NEG 


0.539 


10.0 


0.804 


0.980 


2.0 


0.020 


0.577 


11.0 


0.822 


0.983 


2.2 


0.100 


0.641 


12.0 


0.837 


0.986 


2.5 


0.216 


0.713 


13.0 


0.849 


0.988 


2.8 




0.766 


14.0 


0.860 


0.989 


2.9 


0.324 


0.781 


15.0 


0.869 


0.991 


3.0 


0.347 


0.794 


16.0 


0.877 


0.992 


3.5 


0.440 


0.846 


18.0 


0.891 


0.993 


4.0 




0.880 


20.0 


0.902 


0.994 


5.0 


0.608 


0.922 




0.911 


0.995 


6.0 


0.673 


0.945 


25.0 


0.922 


0.996 



i^SML small” is defined from (2.13) so that the confidence interval 
centered at 8x is wider than the usual (OLS) interval centered at zero. 



2^ RID = 1 — 1.96 I a: I ^ is “ridiculously small” because the point estimate 
8x lies outside the 95% interval for 8 < 8^^^ . 

^NEG means negative. Thus, ^ = 0 is not “ridiculously small” for this 
X = 1.2. We have used the notation C* and 8i in the regression context, 
instead of x and 8 respectively. 

Note that the point estimate of ^ for all 0 < ^ < 8^^^ is outside the 
95% interval centered at x. Avoidance of such extreme shrinkage may be 
justified by “limited translation” arguments; see Efron and Morris (1972). 
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The interval (2.12) is referred to as a UMSE interval, and is not intended 
to be used for the univariate case where the maximum likelihood estimate is 
known to be admissible. In Section 3 we will use (2.12) to obtain confidence 
intervals centered on a ridge regression estimator. 

3. CONFIDENCE REGION FOR REGRESSION PARAMETERS 

For the canonical model (1.5) the standardized parameter is 7 * = 
^- 1 ^ 1 / 2 ^^ It is estimated by 

= (3.1) 

For each % we let C» and 7 * be equivalent to x and ^ respectively of Section 
2 above, and rewrite the UMSE confidence interval from (2.12) as follows: 

LO{Ci,6)<^i<UP{Ci,6), (3.2) 

Now multiply all terms of (3.2) by > 0 and define 

=<rXT^^^UP{Ci,S) 

= SiCf + <tXT^''^(U* + 2.00925•)^/^ (3.3) 

where 

U* = max (^^, {8 - 1) V“^A,(C°)^ + 2^ - l) 

and 

S* = [max(l, 2(25 - 1)*, 45* - 2 + 4(5 - 1 )*< 7 "*A<(C' 9 )*)] . 

Similarly, define the lower bound of the UMSE interval as follows: 

cj:° = 5<C? - + 2.0092S'*)!/*. (3.4) 

From (3.3) and (3.4) we have the approximate relation (a = 0.05) 

Pr [C^° < 7 < < Cy^] = 1 - (3.5) 

for all I = 1 , . . .,p. The probability of the joint event is approximately 
Pr < 7 < = (1 - ay, 

because, from ( 1 . 6 ), are uncorrelated random variables. 



(3.6) 
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In addition to the uncorrelatedness of it is important to note that 
we select only one finite confidence interval from each dimension i. This 
permits the confidence level to remain at the desired level. 

From these intervals for 7 we generate conservative intervals for P, the 
main parameter of interest in a regression model, as follows. By definition, 
P = G7 implies that Pi = Hgij^jy where Qij is the (*,y)th element of G. 
A confidence interval for Pi can be estimated as follows. We note that the 
confidence set is convex, and that a pre-multiplication by a nonstocheistic 
matrix G involves rotation of the axes. For locating the bounds with respect 
to the new axes we need to consider the values at the corners of the convex 
set. Thus we have 

= X^max {gijCf^ ,gi^C^°) 
y=i 

and 

{gijCf^,gijCf°) , 

y=i 

forming a confidence interval for P, centered at the RR estimator. Note that 
these are nonlinear equations, and the bounds are attainable. The reduction 
in variance often achieved by RR is refiected in narrow intervals. For p > 6 
the intervals b£used on (3.7) and (3.8) can be too wide and Kabe^s (1983) 
quadratic programming approach may be used to find confidence intervals 
from Stein’s (1981, equation 8.6) confidence set. Otherwise it may be better 
to evaluate GC at say 10 equidistant values between the upper and lower 
bounds of Ci for each 1, and find the max and min of each 6* over the ten 
evaluations. Then one can search for the corners of the convex set in its 
neighbourhood. 

Obenchain (1977) advocated the use of the usual confidence intervals 
from normal regression theory (centered at 6°) for RR. We suggest that 
this might be considered when our UMSE intervals are wider than the OLS 
intervals, i.e., when Si are “too small” or “ridiculously small”. More general 
confidence sets for linear combinations of P involving the squared length of 
C can be generated. Confidence intervals obtained for stochastic k (or Si) 
are derived in the appendix; however, derivatives of Si with respect to Ci will 
appear in related expressions. The appropriate correction for the fact that 
8^ is a stochastic estimate of involves a slight loss of degrees of freedom. 
This is discussed by Stein (1981) and explained by Vinod and Ullah (1981, p. 
162). These extensions seem to be complicated from a practical viewpoint. 
Some of the theory is explained by Vinod (1980). 

Berger (1980) suggested certain confidence ellipsoids for p. Morris (1983, 
p. 52) commented on Berger’s ellipsoids and noted that the data analyst 



(3.7) 

(3.8) 
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needs intervals for each parameter to replace the familiar t intervals, and 
further suggested approximate empirical Bayes confidence intervals. Morris 
also proposed approximate Bayes 95% confidence intervals based on 

(3.9) 

where V^f are the diagonal elements of the posterior covariance matrix: 

yB ^ yO _ yO^yO ^ yPylyO (3 

For RR we have = a^{X^X)~^ as the sample covariance and — 
{a^/k)Ip as the prior covariance matrix from the Bayesian interpretation 
of ridge regression; see Vinod and Ullah (1981, p. 186). Using X^X = 
GAG', (3.10) simplifies to a^{X*X -f Now the standard errors for a 

computational shortcut discussed by Vinod and Ullah (1981, p. 189) are 
precisely the ones needed for the approximate Bayes intervals. Frequentist 
properties of the approximate Bayesian confidence intervals are unknown 
for RR applications although Morris (1983, p. 52) mentioned encouraging 
simulation evidence for related applications. 

4. AN EXAMPLE 

We illustrate the methodology discussed above using a well-known ex- 
ample involving cement data given by Hald (1952, p. 647) having n = 13, 
p = 4. The computations are given in Table 2. Instead of forcing X*X to 
be the correlation matrix, we use a similar rescaling of the data to force 
the sum of squares for all variables to be 12. The first row of Table 2 gives 
our scale factors for the four regressors. For y the scale factor is 0.066473, 
and is 0.026437. The eigenvalues of X^X are given in the row marked 

The OLS results with corresponding standard errors are given for this 
parameterization in the rows marked and SEi respectively. The usual 
95% confidence intervals using the Student’s t value of 2.306 are given in the 
rows marked and The estimated equals 0.9824 for these 

data. 

The elements of vectors G° = G'6°,G, and the matrix G are also given. 
For UMSE, the computations of and G/'^ are made for k = 0.187, 
with corresponding 8 i shown in the row marked 84 . This does have one ‘‘too 
small” shrinkage, namely, 0.993 = ^1, because Ci = 20.9. For G» = 20, Table 
1 of the previous section shows that bl 84 < 0.994 is “too small” . However, it 
is not “ridiculously small” . 

The choice k — 0.187 is non-stochastic. It implies a “stable” RIDGE 
TRACE in terms of the so-called ISRM quantification of the stability crite- 
rion, based on the multicollinearity allowance, m = p — discussed by 
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Table 2. Confidence Intervals for Cement Data 

1 







1 


2 


3 


4 


sceile factors 




0.1700 


0.0643 


0.1561 


0.0597 






26.8284 


18.9128 


2.2393 


0.0195 


hi 




0.6065 


0.5277 


0.0434 


-0.1603 


SEi 




0.2912 


0.7487 


0.3213 


0.7889 


hY^it) 




1.2780 


2.2543 


0.7843 


1.6589 






-0.0650 


-1.1988 


-0.6975 


-1.9795 






0.6570 


0.0083 


0.3028 


-0.3880 






0.5038 


0.3069 


-0.0642 


-0.3911 


G 


( 1 ) 


0.4760 


-0.5090 


0.6755 


-0.3911 


matrix 


( 2 ) 


0.5639 


0.4139 


-0.3144 


-0.6418 


rows 


( 3 ) 


-0.3941 


0.6050 


0.6377 


-0.2685 




( 4 ) 


-0.5479 


-0.4512 


-0.1954 


-0.6767 


Ci 


20.9294 


0.2220 


2.7868 


-0.3331 


Si 




0.9931 


0.9902 


0.9227 


0.0941 


CVP{UMSE) 




0.7142 


0.0808 


0.4805 


1.7326 


Cf‘°{UMSE) 




0.5907 


-0.0643 


0.0783 


-1.8056 


by^{UMSE) 




0.8021 


1.4407 


0.2630 


0.9119 


bf^{UMSE) 




0.2057 


-0.8269 


-0.3914 


-1.6941 


Relative width 
bi{UMSE/OLS) 




0.44 


0.66 


0.44 


0.72 



Vinod (1976, p. 838). The solution k = 0.184 corresponds to m = 1, and is 
also favored by other criteria as discussed by Vinod (1976, 1978), Obenchain 
(1975, 1978), and Mallows (1973). It is close to Hoerl et aVs (1975) choice 
baaed on = 0.157. For these data multicollinearity is a serious 

problem, because Ap = 0.0195 is considerably smaller than Ai. 

The usual confidence intervals bY^{t) and h^^{t) are seen to be wider and 
perhaps less informative than our and for UMSE. Our nonlinear 
transformation (3.7) and (3.8) seems to have successfully yielded narrow 
intervals for from the narrow intervals for 7 ^ The last row of Table 2 
highlights the dramatic reduction in the width of OLS intervals by using 
UMSE. 
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5. SIMULATION TO EVALUATE APPROXIMATIONS 

Because of some intrinsic difficulties associated with confidence intervals 
for bisised estimators, we had to resort to certain approximations in deriv- 
ing our UMSE intervals. For = 1 these intervals coincide with the OLS 
intervals, and no approximation is needed. However, for 0 < < 1 it is 

not obvious whether our approximations will yield reliable confidence coeffi- 
cients. To obtain a better understanding of the properties of our confidence 
intervals we have designed a simulation study as follows. The simulation 
involves several steps which are consecutively numbered for convenience. 

(Step 1) Choose a basic data structure of the cement data of the previous 
section. This involves using the same eigenvalues and eigenvectors. (Step 
2) Choose some arbitrary values for the true unknown 7^ Our four sets of 
four 7i values are respectively: (.12, .28, .56, -.77), (1.01, -.55, -1.02, 1.27), 
(.80, -2.09, 4.26, -1.35), and (5.68, .27, 6.10, 5.52). This choice implies four 
signal-to-noise ratios (SNR): 1, 4, 25, and 100, where SNR = 7'7/a^ = 
(Step 3) The corresponding sets of true unknown Pi are found by 
the definitional relation P = G7. (Step 4) Specify = 1, and the non- 
stochastic 6i values from the previous section 0.9931, 0.9902, 0.9227 and 
0.0941 respectively. (Step 5) Initialize the computer counters for mezusuring 
the number of times the estimated intervals for 7* and pi cover the respective 
true unknowns. Also, initialize the counters for measuring the widths of 
various intervals. (Step 6) Use a “super duper” generator for normal random 
numbers denoted by iV(0, 1), based on a computer program developed by 
Marsaglia and others at McGill University in Canada. Obtain the OLS 
estimates of 7* by the relation: 

C? = 7.■ + JV(0,l)aA^l/^ (5.1) 

(Step 7) Obtain UMSE{8X) of (2.2b) for SiCi zis an estimator of 7*. (Step 
8) Obtain the UMSE confidence intervals from (2.12). Obtain OLS intervals 
by substituting 8i = 1 in (2.12). (Step 9) Compute the absolute values of 
the widths of OLS andUMSE intervals for 7». Verify that the OLS intervals 
are 2a A“ ^^^(1.96) wide. (Step 10) Find the OLS and UMSE intervals for Pi 
by using the relations (3.7) and (3.8). (Step 11) Compute the four widths 
of the intervals for both Pi and 7*, and for each of the two types of intervals 
considered here, namely the OLS and UMSE. These are given in Table 3. 
Now store the appropriate widths. Compute whether the observed inter- 
vals bzised on the current realization of the normal random numbers covers 
the true Pi, (and 7*). If the coverage is observed, increase the appropriate 
counters by 1. (Step 12) Repeat steps 5 to 11 for a total of 1000 realiza- 
tions. In our simulation the coverage of Pi was achieved in more than 950 
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cases for each «, and each SNR. (Step 13) Find the coverage proportions by 
dividing the coverage counts by 1000, and the average widths in a similar 
fashion. (Step 14) Compute an “overall” coverage proportion, as the fourth 
root of the product of four coverage proportions associated with the four 
regressors. This is to be done four times, i.e., for both 7* and and for 
OLS and UMSE. (Step 15) Define the “relative width” of UMSE as the ratio 
of the UMSE width to the corresponding OLS width. The “overall” width 
of UMSE is then computed as the average of the “relative widths” over the 
set of the four regressors. This is to be done four times, as before. (Step 
16) Summarize the results in a Table. (Step 17) Repeat Steps 4 to 16 for 
alternative choices of 6i. 



Table 3. Average Width for the ith Coefficient Over 1000 Problems 



SNR 


1 


GAMMA PARAMETER 


BETA PARAMETER 


OLS 


UMSE 


OLS 


UMSE 




1 


.756812 


.751577 


9.35928 


8.64175 


1 


2 


.901381 


.892573 


19.64680 


18.04290 




3 


2.619580 


2.421320 


10.05430 


9.27819 




4 


28.082500 


25.690900 


20.33670 


18.67270 




1 


.756812 


.752066 


9.35928 


8.73603 


4 


2 


.901381 


.892761 


19.64680 


18.27390 




3 


2.619580 


2.434470 


10.05430 


9.38171 




4 


28.082500 


26.043700 


20.33670 


18.91440 




1 


.756812 


.751881 


9.35928 


8.94112 


25 


2 


.901381 


.896098 


19.64680 


18.36500 




3 


2.619580 


2.739000 


10.05430 


9.57538 




4 


28.082500 


26.034500 


20.33670 


18.96900 




1 


.756120 


.767028 


9.35928 


9.57562 


100 


2 


.901381 


.892562 


19.64680 


19.64820 




3 


2.619580 


3.007520 


10.05430 


10.24900 




4 


18.08250 


27.89140 


10.33670 


20.28480 



NOTE: See Steps 9 to 11 described in the text. 

Table 4 gives a summary of the above simulation. The simulation does 
support the theory of the previous sections, and indicates that the new con- 
fidence intervals can achieve narrower widths without sacrificing the overall 
confidence levels. The coverage proportions (unknown to the researcher) 
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are always larger than 95% and the averages of the “relative widths” are 
rarely much larger than unity. Since the width is known to the researcher, 
using our methods cannot mislead anyone, even if the width turns out to be 
slightly larger. Slightly larger width for UMSE interval when SNR is 100 
in Table 4 indicates that RR may not be better than OLS in these coses. 
This is consistent with recent ridge literature reviewed in Chapters 7 and 
8 of Vinod and Ullah (1981), and shows that UMSE intervals can discour- 
age the use of RR in unfavorable cases. Thus wider UMSE intervals do not 
necessarily represent a disadvantage. 

Table 4. Confidence Interval Widths and Coverage: 

Simulation Over 1000 Problems 



Confidence LeveP Relative Width^ 



SNRi OLS UMSE UMSE 



1 


7 


0.94 




P 


0.96 


4 


7 


0.94 




P 


0.97 


25 


7 


0.94 




P 


0.97 


100 


7 


0.95 




P 


0.98 



0.96 


0.96 


1.00 


0.92 


0.96 


0.96 


1.00 


0.93 


0.96 


0.99 


1.00 


0.94 


0.96 


1.04 


1.00 


1.01 



^ Signal-to-noise ratio is 

^ Confidence level is the fourth root of a product of four coverage proportions 
for four regressors over 1000 problems. 

^ Relative width is relative to the average width of OLS intervals over 1000 
problems. 

One can imagine a practical user of our methods trying both OLS and 
say, UMSE and then using the smaller of the two confidence intervals. Since 
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the theoretical properties of such interdependent actions appear to be in- 
tractable, we discourage this approach. When the new confidence intervals 
are wider, RR may not be rejected completely. If ridge results are reported, 
both types of confidence statements should be reported. Note, for exam- 
ple, that a traditional user of standard normal tables may correctly consider 
many 95% intervals: (-2.0, 1.92), (-1.96, 1.96), (-1.88, 2.06), etc. 

An optimistic eissessment of our methods based on a simulation contin- 
ues to hold true for additional choices of 6i not reported here. Whenever the 
6i arise from a realistic problem they are rarely much smaller than corre- 
sponding “too small” values; and therefore UMSE intervals are usually 

narrower than those for OLS. For artificial problems having 6i < we 

recomihend using only the OLS intervals for all coefficients. 

It is interesting to note that the coverage proportions and confidence 
levels are usually smaller for the canonical model having 7» as the unknowns, 
compared to the original model having Pi as the unknowns. This observation 
is reeissuring to the reader who may be uncomfortable with the nonlinear 
transformation of equations (3.7) and (3.8), because it shows that these 
transformations are conservative. 



6. QUALMS ABOUT BOOTSTRAP CONFIDENCE 
INTERVALS FOR RIDGE REGRESSION 

The main purpose of this section is to show that ridge regression rep- 
resents a case where one may have “qualms” (Schenker, 1985) about boot- 
strap confidence intervals. Since ridge regression yields biased estimators, 
bootstrap assumption about the existence of a pivotal quantity whose dis- 
tribution does not depend on unknown parameters is not satisfied. First, we 
will briefly review the basic bootstrap ideas in the regression context. Next, 
we apply them to ridge regression illustrated by the cement data example 
studied before. We will also compare the results with approximate Bayes 
results for the same data. 

The residuals from the OLS estimator, denoted by e = y — play a 
special role in bootstrapping. The covariance matrix of 6® is 

Cov(6°) = s^(X'X)"^, 8^ — (e'e)/(n - p - 1). (6.1) 

An empirical CDF of OLS residuals puts probability mass 1/n at each e*, 
and is denoted here by Fe» Now a basic bootstrapping idea is to use 
with mean zero and variance as a feasible, approximate, nonpar ametric 
estimate of the CDF of the true unknown errors denoted by F^. Let J be a 
suitably large number, which for the cement data is 500. We draw J sets of 
bootstrap samples of size n, which is 13 for the cement data, with elements 
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denoted by e*yt (i = 1,2,..., J and t — l,...,n) from using random 
sampling with replacement. One generates J sets of n x 1 vectors denoted 
by having elements e*yt {t = l,...,n). Hence the pseudo y data are 
obtained by: 

y.j = Xb° + e.i, j=l,2,...,J (6.2) 

which yields a large number, J, of regression problems to be used for boot- 
strap inference described below. We ensure that the probability that e*yt 
equals et is 1/n. Hence the variance is: 

a^ = E {eljt) = ^(l/n)var(et) = (1/n) = s=*(n - p)/n. (6.3) 

t=l t=l 

This shows that when the residuals are rescaled by the square root of 
n/(n — p), the variance of the rescaled e*yt should be approximately equal 
to s^. For Hald’s cement data with n = 13 and J = 500, the variance (6.3) 
is computed to be 0.026512, which is very close to the of 0.026437; which 
suggests that J = 500 is adequate. 

Efron (1982, p. 36) states that (6.2) is a “standard linear model written 
in unusual notation” . If one applies OLS to the pseudo y data one obtains: 

h.j = {X'X)-^X'y.j and Cov(b.j) = <r^(X'X)-^ (6.4) 

for y = 1, . . ., J. This expression for the covariance matrix is the same as 

Cov(6°) provided we use rescaled residuals based on (6.3). In other words, 

the standard errors for OLS regression coefficients calculated by the boot- 
strap and the conventional methods are almost identical. 

Let the empirical CDF of the J estimates 6*y be denoted by CDF^{z) = 
where # denotes the number of times the condition is ob- 
served. For given a between 0 and 0.5 (without loss of generality) we define: 

bLo{ot) = ICDF^{ot), and hup{a) = ICDF^l - a), (6.5) 

where ICDF^{z) denotes the inverse of CDF^{z). Now we can write the 
probability: 

Prob* [bLo{^) ^ b^j < bup{o^)] = 1 — 2a. (^-^) 

This can yield a confidence interval for provided we can replace 6*y by P 
as an approximation. 

Since the above approximation may not be good, Efron suggests the 
following refinement. Its basic idea is the assumption that b^ — P and — 
6° have approximately the same distribution, and that this distribution is 
symmetric about the origin. To visualize why this is a refinement note that 

b^-p = {X^X)-^X‘y -p = (X'X)-^X'u = Wu, (6.7) 
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where W = (X'X) ^X' is a known matrix. Since (6.7) involves the un- 
knowns u, it is approximated by: 

6., -b° = (X'X)-^X'y,, - b° 

= (X'X)-^X'(Xb° + e.j) - 6°, from (6.2) 

= (X'X)-^X'e,, = We,j. ( 6 . 8 ) 

Note that We^j can be reasonably expected to approximate Wu. To 
obtain the left hand side of (6.8) we subtract 6° from each term of (6.6) and 
write 



Prob* [bLo{oL) - 6° < - 6° < bup{a) - 6°] = 1 - 2a. (6.9) 

Now we replace 6*y — 6® in the middle term of (6.9) hy b^ — on the left 
hand side of (6.7), which is appropriate because the right hand side of (6.7) 
can be approximated by the right hand side of (6.8). This manipulation may 
be called a ‘‘reflection” of the confidence interval (6.6) through 6^. Thus we 
write 



Prob* [26° -bup{oi)<^< 26° - 6jr.o(a)] = 1 - 2a. (6.10) 

Efron’s (1982, ch. 10) bias corrected percentile method makes the same 
“reflection” after a hypothesized standardization to N{0y 1), a unit normally 
distributed pivotal quantity denoted by 0. Let CDF^{$) denote the CDF 
beised on bootstrap replications applied to the estimated pivotal quantity. 
Now denote 

Zo = ^-^[CDF»{e)] ( 6 . 11 ) 

using the distribution function $ of the standard normal variate. 

The bias corrected percentile method leads to the following approximate 
1 — 2a confidence interval: 

{ICDF,{^[2zo - ^a]), ICDF4^[2zo + z^])} , (6.12) 

where ICDF^, is the inverse CDF^, as before, and Za is the upper a point 
for the iV^(0, 1) normal distribution function ^(>?o) = I — ct. If the bootstrap 
distribution is median unbiased, it can be shown that the bias correction 
makes no difference. 

Since bootstrapping does not give better parameter estimates (Peters 
and Freedman, 1984), we do not use it to suggest yet another choice of k in 
ridge regression. The confidence interval procedure proposed in this paper 
accepts the mean squared error (MSE) reducing choice of k made by the 
investigator and provides a confidence interval conditional on that choice. 
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For the cement data, k = 0.187 as in Section 4 above. Instead of (6*y - h) 
in (6.8) we have here 

[{X'X + kI)-^X^X -I]b^+ [{X^X + kI)-^X'u,j] , (6.13) 

which is used to approximate the following quantity which is similar to the 
left hand side of (6.7): 

6^ - /? = [{X^X + kI)-^X'X -I] 13-^ [{X'X + kiy^X^u] . (6.14) 

The corresponding right-hand side in (6.7) is simply W'u, which does 
not depend on the unknown parameter /?. In the czise of OLS when A; = 0, 
the first term on the right hand side of (6.14) vanishes, and the right hand 
side of (6.14) becomes Wu. Unfortunately, in the case of ridge based 
equation (6.14) does depend on and is obviously not a pivotal quantity. If 
we use (6.13) to approximate (6.14), it may be incorrect to the extent that 
b^ does not equal p. Thus the criticism by Schenker (1985) applies, and we 
note that the ridge based bootstrap confidence intervals cover the regression 
parameters with probability 1 - 2a only approximately. The trouble is that 
the coverage probability itself is affected by the value of the unknown pa- 
rameters. It is tempting to use a simulation to assess the importance of this 
problem, except that the conclusions from the simulation will depend on the 
values of parameters assumed. 

Table 5 reports the 95% bootstrap confidence interval for the cement 
data, where the column marked 1 refers to the first regression coefficient as 
in Table 2, which is explained as follows. Ridge estimation of 500 pseudo y 
data yields 500 regression coefficients. These are first ranked in an ascending 
order of magnitude as an approximation to the sampling distribution. Since 
2.5% of 500 is 12.5 the simple average of 12th and 13th ranked regression 
coefficients yields 0.3730 as the lower limit. The upper limit 0.6365 is a 
similar average of 487th and 488th ranked values. The reflection of this 
interval using the ridge estimate 0.5038 instead of 6^ in (6.10) from along 
the 8th row of Table 2 yields 0.3711 and 0.6346. Now the 247th ranked value 
is 0.5033 which is closer to the ridge value 0.5038 than any other value. Hence 
we say that the ridge estimate is slightly median biased because the median 
is the average of the 250th and 251st ranked values. Now Zo of (6.11) is -0.15, 
using the Tables for the cumulative normal distribution. Instead of -1.96 and 
1.96 we have -1.99 and 1.93 as the numbers used to find the corresponding 
values of the normal CDF to be 0.0233 and 0.9732, yielding 12th and 486th 
ranked values as the “bias corrected” percentiles reported to be 0.3724 and 
0.6310 in Table 5. Other columns of Table 5 are similarly computed. Table 
5 also reports approximate Bayes confidence intervals using equation (3.9), 




RIDGE CONFIDENCE INTERVALS 



297 



Table 5. Cement Data Confidence Intervals 
for Approximate Bayes and Bootstrap 

i 



12 3 4 



Approximate Bayes Interval: 



Lower Limit: 0.2805 

Upper Limit: 0.7273 


-0.1500 

0.7639 


-0.2986 

0.1704 


-0.8696 

0.0876 


Bootstrap Interval from Quantiles: 

Lower Limit: 0.3730 

Upper Limit: 0.6365 


0.1529 

0.4423 


-0.2096 

0.0642 


-0.5463 

-0.2539 


Bootstrap Reflected Interval: 

Lower Limit: 0.3711 

Upper Limit: 0.6346 


0.1715 

0.4609 


-0.1926 

0.0812 


-0.5283 

-0.2359 


Bootstrap Bias Corrected Interval: 

Lower Limit: 0.3724 

Upper Limit: 0.6310 


0.1471 

0.4333 


-0.2075 

0.0685 


-0.5463 

-0.2539 



Notes: All are 95% intervals comparable to those of Table 2. Equation 
(3.9) defines the approximate Bayes method. Quantiles are based on the 
ranking of ridge estimators from pseudo y data for J = 500 replications. 
Reflected and Bias corrected intervals are based on equations (6.10) and 
(6.12) respectively, modified by using the ridge estimator instead of the OLS 
estimator. 

which are found to be closer to the UMSE intervals of Table 2 than the 
bootstrap intervals, for this choice of A;. In unfortunate cases where ridge 
regression may not be appropriate, both approximate Bayes and bootstrap 
intervals may be too narrow and too optimistic about the correctness of 
ridge estimates. 

Even though one may not be satisfied with the approximate coverage 
probabilities, the plots of the bootstrap distributions can be useful to the 
economist in other ways, and may be closely studied and interpreted. They 
can indicate how sensitive the reported confidence interval is to the arbitrary 
choice of the significance level, to the standard errors baised on the normality 
assumption, etc. These intervals are narrower than those bEised on Stein’s 
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UMSE, and one may wonder about their reliability, since they require the 
strong assumption that is a good approximation to p. 

7. CONCLUSION 

One should not expect the confidence interval problem to have a unique 
solution. This paper has studied four alternative methods of construct- 
ing confidence intervals for the regression parameter estimated by ordinary 
ridge regression. The OLS interval is the simplest, and if it is economically 
meaningful there is no need to consider any other. Often, the OLS interval 
is centered at the coefficient with the “wrong sign”, so that much of the 
interval lies in a range which is economically meaningless. In these applica- 
tions the practical motivation behind using ridge regression is that it yields 
economically meaningful estimates, such as positive marginal propensity to 
consume (MFC). If the economist wishes to use a confidence interval cen- 
tered at the ridge estimate the choice is between three methods: approximate 
Bayes based on (3.9), UMSE from (3.7) and (3.8), and finally the bootstrap. 
The approximate Bayes may not provide sufficient penalty for using a ridge 
biasing parameter k that is “too large” , since it uses square roots of the diag- 
onal elements of the inverse of the {X'X + kl) matrix in the standard error 
formulas, which means that the standard error may decrease monotonically 
with increasing k. Similarly, bootstrap intervals may not provide a sufficient 
warning against a bad choice of k. 

The UMSE intervals proposed here are new and conservative in the sense 
that they are narrower than the OLS only when the reduction in variance 
achieved by ridge regression is sufficiently large, without using “too small” 
shrinkage factors. The simulation study reported here indicates that some 
of the approximations used here do not jeopardize the validity of the basic 
confidence statement. Its computational burden is lighter than that of the 
bootstrap. The reporting of UMSE intervals can discourage potential mis- 
uses of ridge regression. The modification of UMSE for the C 2 ise of stochastic 
k is discussed in an Appendix. In the author’s own experience, the presence 
of a stochastic k does not make a practical difference, but the UMSE intervals 
for large p are found to be too conservative. 

APPENDIX: EXTENSION FOR STOCHASTIC SHRINKAGE FACTORS 

For simplicity, the discussion in Section 2 assumes that the shrinkage 
factor 8 is nonstocheistic. Now we consider the general ca^e where ^ is a 
function of the random variable x. We use Lemma 4 of Stein (1981) to 
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write: 

E[{x + f)-i]^ = l-Ef^ + 2Ef, (A.1) 

where / is /(x), and /' is the derivative with respect to x. Note that if one 
chooses f = {6 — l)x one obtains (2.1) as a special case of (A.l). Now the 
left hand side of (2.3) is a special case of: 

E[{x + f-iY - (1 + / 2 + 2 /')]' 

= E[{x-iy+2{x-of-i-yf 

= 2 + 4E[f^ + 2f + f% (A.2) 

where the expressions involving (x — f)^^(a:) are replaced by derivatives of 
g{x) for j — 1,2,3 by the Lemma mentioned above. The expression (A.2) is 
remarkably simple involving only the first derivative, thanks to a cancellation 
of all higher order terms. 

If ^ is a function of x, the /' in (A.2) will be {6 — 1) instead of (^ — 1) , 
which leads to a modified expression for SD^ of (2.8), especially changing 
the term after the last comma. When 6i is stocheistic in the regression 
application, one can use: the analytical expression for its derivative if it 
is available; or a geometrical estimate from the ridge trace; or consider a 
“small” change in k and approximate 8^ by the ratio of the change in 6i to 
the change in C». Although the S* defined after (3.3) is modified by using 
(A.2) to allow for stochastic Si, there seems to be no conceptual difficulty in 
deriving confidence intervals in the stochastic shrinkage case. 
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ASYMPTOTIC PROPERTIES OF SINGLE EQUATION 
ERRORS IN VARIABLES ESTIMATORS IN 
RATIONAL EXPECTATIONS MODELS 

1. INTRODUCTION 

The purpose of this paper is essentially twofold, first to introduce some 
new single equation errors in variables estimators for simultaneous equations 
models containing rational expectations variables, and second to derive their 
asymptotic properties. In addition, a consistency proof for a new estimator 
due to Fair (1984a) is presented. 

Single equation errors in variables estimators have been used extensively 
in the applied rational expectations literature since their introduction by 
McCallum (1976). This popularity has been due largely to their simplicity 
and ease of implementation as compared to the more complex and expensive 
systems estimation methods explored by Hansen and Sargent (1980, 1982), 
Wallis (1980), and Fair and Taylor (1983) among others — avoiding, in par- 
ticular, the necessity of solving the system. Additional advantages of the 
single equation errors in variables estimators include (i) a certain degree of 
robustness to misspecification elsewhere in the system, (ii) the capability of 
estimating a single equation from a very large system too complex for the full 
information methods (Fair, 1984a), and (iii) the fact that they continue to 
be consistent even when expectations are not in fact rational as long as the 
instruments used are in the information set used to form the expectations 
(Fair, 1984b). 

The organization of the paper is as follows: Section 2 describes the types 
of situation which call for the estimation techniques herein considered. Sec- 
tion 3 outlines the existing estimators which have been proposed. Section 4 
introduces the various new estimators, and Section 5 presents the asymptotic 
results. 
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2. THE ESTIMATION PROBLEM 
Consider the following equation from a simultaneous equations model: 
Vt — + lit, (1) 

where j/t is a scalar endogenous variable, ut is a disturbance term, Zt is a 
row vector containing k endogenous and h — k—1 predetermined variables, 
^ is an h X 1 vector of coefficients, t-i^t is the rational expectation of an 
endogenous variable not included in Zt and not on the left hand side of the 
equation, and [t-i^t^t] is a 1 X row vector. We consider various cases. 

(a) tit serially correlated 

In this case we assume that E[ut | /t~i) = 0, where It-i is the infor- 
mation set available at time ^ — 1, which implies that Ut are not serially 
correlated. The errors in variables “trick” is to replace the unobservable 
t_i^t with the observable Zty yielding 



yt = \ztZt]8 + Ut - Sit}t = QtS + t>t, (2) 

where is the first element of 8, rjt is the forecast error zt — t-i^t which 
by the rational expectations hypothesis satisfies E(f7t 1 ft-i) = and Vt 
is the composite disturbance. For estimation we now need to find some 
suitable instruments. Notice that using Zt as a proxy results in the compos- 
ite disturbance vt being potentially correlated with all the elements of Qt 
not included in /t~i, except those predetermined variables which are per- 
fectly predictable at time t — 1, e.g., time trends and constants. Instruments 
are, however, available in the form of any lagged endogenous or exogenous 
variables in the model as these are all included in It-i and of course, the 
perfectly predictable time t variables. 

We may then obtain consistent estimates of 8, These estimates are 
termed 2SLS in the literature and we maintain this usage, although, strictly 
speaking, they should be referred to as instrumental variables. Moreover, 
the estimated standard errors will be consistent estimates of the true stan- 
dard errors, this following from the fact that the composite disturbances 
are serially uncorrelated, given one further 2 issumption, namely conditional 
homoscedasticity, i.e., E{vt | Ft) = for all t, where Ft denotes the instru- 
ment set. 

(b) Ut serially correlated 

Consider the following example; 



Vt = txt — P^t-i + E{wt I /t-i) — 0; I P I < 1* (3) 
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Substituting for the rational expectation we obtain: 

yt = H- ut — SiTff (4) 

It is convenient to remove the AR part of the composite disturbance by 
quasi-differencing which yields: 







( 5 ) 



where vt = Wt — Sirjt + 

It may now be ascertained that Vt are serially correlated, although only 
of first order, and therefore, by a theorem due to Anderson (1971), may be 
expressed as an MA(1) disturbance. 

Instruments for this AR(1) example need to be lagged two periods in- 
stead of the one period as before, this following from the fact that whereas 
E{vt I h-i) 0, E{vt I /t- 2 ) = 0- Having selected appropriate instruments 
we may then estimate as before except that having quasi-differenced we need 
to use nonlinear two stage least squares (NL2SLS) in place of 2SLS. Unfor- 
tunately, however, although the parameter estimates will be consistent, but 
not asymptotically efficient, the estimated standard errors will not even be 
consistent. 

A similar problem arises whenever any of the following circumstances 
are obtained either individually or severally: (i) serial correlation in the 
structural disturbance, (ii) expectations formed in any period later than i—1, 
and (iii) expectations horizons exceeding the sampling interval (Cumby et 
aJ., 1983; Brown and Maital, 1981; Hakkio, 1981). 

Two problems therefore arise: (i) how to improve the efficiency of the 
parameter estimates and (ii) how to obtain consistent estimates of the stan- 
dard errors. 

To solve these problems, we outline some of the existing estimators which 
have been designed to cope with the problem. Before this, however, it will 
be convenient to establish a framework provided by Cumby et al. (1983) 
into which these models fall. 

Essentially, our aim is to estimate parameters of models of the general 
form: 

Y = Qg{6) + V, (6) 

where g is a. one-to-one function taking elements of the parameter space into 
a space of equal or greater dimension, subject to the existence of an integer 
N and a row vector, Et, of instruments satisfying: 

E{vt I Vt_JV,Vt-7V-l,-**,^t,Et_i,...) = 0. 



( 7 ) 
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Clearly, our model falls into this framework as it allows for serial correlation 
and it assumes that the instruments are predetermined rather than strictly 
exogenous (Hayashi and Sims, 1983). We shall assume henceforward that a 
correct set of instruments is being employed at all times. 

3. EXISTING ESTIMATORS 

We now consider some of the existing estimators which have been put 
forward to deal with this type of estimation problem. 

(i) McCallum (1979) suggested using both NL2SLS to estimate the parame- 
ter of (6) and a correction, provided by Hansen (1982), for the inconsistent 
estimates of the standard errors. Cumby et al. (1983) have shown that 
this estimator falls into the cleiss of single equation analogues of Generalised 
Method of Moments (GMM) estimators (Hansen, 1982) exploiting the or- 
thogonality conditions E{Flvt) = 0 into which Cumby et al.^s two step, two 
stage least squares (2S2SLS) estimator (discussed next) also falls; they also 
have shown that within this class 2S2SLS is asymptotically efficient. 

(ii) Cumby et al. (1983) have recently put forward the 2S2SLS estimator. 
This estimator may be developed in the following way. Transform model (6) 
by the transposed instrument matrix F' to yield: 

F'F = F'Qg{6) + F'v. (8) 

By the orthogonality conditions, E(F{vt) — 0, nonlinear least squares will 
give a consistent estimate of S. But, as a result of E(F^vv'F) not being pro- 
portional to the identity matrix, we can produce a more efficient estimator. 
Let Qt = {l/T)E{F'vv*F) and assume that f2r is a positive definite matrix. 
Then Qt niay be expressed as RR\ where R is non-singular, with the result 
that the transformed model: 

r-IF'y = R-^F'Qg{6) + R~^F'v (9) 

has a disturbance matrix proportional to the identity matrix. The 2S2SLS 
estimator irtay now be defined. It is the vector S 2 S 2 SLS that minimises the 
criterion function: 

$(5) = [y - n-if'[y - Qy(5)], (lo) 

where G is a consistent estimator of Q = lim(l/T)F(F'vt;'F). Various differ- 
ent estimators of Q have been put forward by Cumby et al. (1983), Hansen 
(1982), and Newey and West (1985). 
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(iii) Hayashi and Sims (1983) proposed an estimator which exploits a dif- 
ferent set of orthogonality conditions, namely E{Fl__^Vt) = 0, jf > 0. In 
addition, they assumed that the serial correlation in v does not depend on 
F, that is 

E{vtv, 1 Ft, Ft-i, ...) = E{vtv,) for t > 8. (11) 

This leist assumption is necessary to evaluate the asymptotic variance- 
covariance matrix for this estimator. This evaluation is carried out in Section 
5. Hayashi and Sims proceeded by transforming model (6) by the upper tri- 
angular inverse of F, where E{vv^) = a^V and V = FF' with F non-singular: 

p-iy ^ T-^Qg{6) + T-\. (12) 

By the orthogonality conditions stated above, J6?(jP/(F”^v)t) = 0 and there- 
fore we may use Ft os instrumental variables to obtain a consistent estimate 
of 5. This involves the minimization of the criterion function: 

^6) = [Y- Qg{6)yV-^‘ F{F^F)-^rr-^[Y - Qg{6)], (13) 

Intuitively what is going on here is that we are “forward filtering” the equa- 
tion (6) in such a way that we eliminate the serial correlation in the distur- 
bance term. But, not only are we eliminating the serial correlation, we are 
also transforming the equation in such a way that the transformed distur- 
bance terms are orthogonal to the instrument matrix F, which may then be 
used to provide consistent estimates. 

To implement this technique, Hayashi and Sims (1983) proposed estima- 
tion of the matrix F“^ by one of two methods. To outline these methods, it 
will be convenient to make a slight digression. 

We assume for simplicity that the composite disturbance vt may be 
expressed as an MA(1) process, i.e., it may be written as either: 



Wt = €t - 6et-i = ^I^et (14) 

or 

vt = it-Ht+i = {i-eL-^Ht, (15) 

the former being the standard “backward representation” the appropriate 
filter is (/ - 0L)~^, while in the case of the “forward representation” and the 
latter being the “forward representation” . To eliminate the moving average 
nature of the composite disturbance we need to filter Vt. In the case of the 
“backward representation” the appropriate filter is (I — ^L)“^, while in the 
C2ise of the “forward representation” the appropriate filter is (/ — 

We now concentrate exclusively on the “forward representation”, this be- 
ing the relevant one for our purpose. An alternative way of representing a 
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moving average process of finite order is as an infinite order autoregressive 
process. In our case this means that 

(/ - eL~Y^vt = {i+ei-^ + e^L-'^ + . . .)vt = 6 - (is) 



We may now outline the two methods of implementation proposed by 
Hayashi and Sims (1983). 

The first, denoted by HSi hereafter, consists of the following: estimating 
the moving average coefficient, from the residuals of a first stage estimate 
of (6) by some consistent technique, e.g., NL2SLS; truncating the autoregres- 
sive representation of the forward filter (16) at the sth lag, where s = \/f; 
and transforming model (6) by the estimate of the matrix below: 

rl $ • . . . 0 • • • on 

\o I 0 0^0 



Lo • • • 0 1 0 



0 



(17) 



The transformed model (6) is then estimated by instrumental variables using 
untransformed instruments. 

The second method, denoted by HS 2 hereafter, consists of the following: 
estimating an unstructured autoregressive representation of order s, i.e., 
estimating the parameters of: 



Vt = OiiVt-i + OC2Vt^2 + . . . + Ote^t-9 + (I 8 ) 



where the are the residuals from consistent estimation of (6); trans- 

forming (6) by the following estimate of the matrix: 



rl -ai -6t2 .... -da 0 

0 1 — —6t2 —da 



Lo • • *01 -di -6l2 



0 1 



; ( 19 ) 



0 

-da- 



and then again estimating the transformed model (6) by instrumental vari- 
ables using untransformed instruments. 

It will be noted that both of the estimated F“^ matrices are of order 
(T — s) X T; hence a observations are lost. 

As we can see from the above, the estimators proposed by Cumby et al. 
(1983) and Hayashi and Sims (1983) exploit different sets of orthogonality 
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conditions, and as a result, for a fixed instrument list, it is impossible to say 
which of them will be asymptotically efficient. We say “for a fixed instru- 
ment list” because Hayashi and Sims have shown that when (7) holds, as 
the sample size and the instrument list increase, the asymptotic variance- 
covariance matrices of the two estimators approach the same limiting m^r 
trix. Additional instruments come from lagging Ff. Moreover, they have 
demonstrated that this limiting matrix is the asymptotic variance-covariance 
matrix of Hansen and Sargent’s (1982) optimal instrument set instrumen- 
tal variables estimator. Unfortunately, one cannot determine this optimal 
instrument set without having details of the entire structural model. 

We should also mention that standard estimation methods to cope with 
serial correlation in this context are not consistent, this being pointed out 
first by Flood and Garber (1980). 



4. NEW ESTIMATORS 



We now introduce our new estimators which essentially represent po- 
tential improvements on the Hayashi and Sims estimators, HSi and HS 2 , 
outlined above, at least in small samples. The rationale for these new es- 
timators comes from two sources, theoretical and empirical, and attention 
will be devoted to both. 

It will be recalled that both HSi and HS 2 drop s observations from the 
sample. Clearly, in a small sample this procedure is wasteful and it would 
be preferable to retain the dropped observations. With respect to HSi the 
immediate idea which comes to mind is to fill in more of the estimate of the 
matrix, i.e., to transform (6) by the following matrix: 



rl $ 



. . 9 

.0 • • • 0 1 . 



( 20 ) 



This transformation is equivalent to transforming (6) by a truncated forward 
filter comprising only the sample elements of y* and (J*. We denote this 
estimator by HSs hereafter. An equivalent representation of the transformed 
(6) is: 

( 21 ) 

where y;,= VT+i = 0; Ql = Qt+i = 0; v** = 

0*vt^iy vr +1 = 0 and the summations are over the range i = 0 to i = T—t. 
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A backward form of this transformation has been used in the literature 
before by, inter alia, Harvey and McAvinchey (1981), Pollock (1979), and 
Phillips (1966). However, a variation on this transformation, which has been 
proposed in the backward case for the regression model with MA(1) errors by 
MacDonald and McKinnon (1985), and extended recently to the regression 
model with MA(g) errors by Ullah et aJ. (1985), and extended still more 
recently to the simultaneous equations case by Ullah and Power (1985), is 
available and constitutes our second new estimator which we denote by HS 4 
hereafter. In the MA(1) case this estimator may be represented as: 



vt =Q:«7(«)+G;7+et, 



( 22 ) 



where y^, are defined as above; 7 = — ^r+i and and 

G^^i = 1. Essentially, the conversion of to white noise represents the 
additional twist in ^*54 as opposed to 

Both HSz and HS 4 retain all T observations, and are variations on HSi] 
this would seem on theoretical grounds to be a potential improvement. Some 
empirical evidence is also available which points in this direction. This evi- 
dence comes from Monte Carlo work on the regression model with MA(1) er- 
rors and backward transformations which has been done by Balestra (1980), 
Park and Heikes (1983), and Harvey and McAvinchey (1981). In general, an 
estimator analogous to HS4 is found to dominate an estimator which is anal- 
ogous to HSz, which in turn is found to dominate an estimator analogous 
to HSi. However, an estimator analogous to the “exact transformation” es- 
timator, to be discussed next, is found to dominate all of the above. Before 
turning to this “exact transformation” estimator we mention a potential im- 
provement on HSz- This estimator, denoted by HSz, is formed as HSz but 
with the 8 dropped observations being made up with terms analogous to the 
additional terms that the Prais-Winsten estimator has over the Cochrane- 
Orcutt estimator in the conventional regression model with AR(p) errors. 

We now consider the “exact transformation” estimator. This estimator 
takes as its point of departure the variance-covariance matrix of the com- 
posite disturbance Vt. For convenience we shall assume here, as above, that 
Vt is MA(1). In this case the variance-covariance matrix is a^V where V 
is the well known tri-diagonal matrix with 1 + on the leading diagonal 
and — ^ on the off diagonals. The “ex2w:t transformation” estimator proposes 
transformation of (6) by an upper-triangular matrix H constructed such that 
H*H = V~^, In the MA(1) case we have developed this matrix H explicitly. 
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It may be expressed as A where 



r O^ax—il olt 
0 



0^ax—2l^T—i 



M = 



0 



• 0'^ ^ao/ax 

^^“^ao/ar- 



(23) 



0 6^ao/ai J 



A = diag(ar_i/ar, aT- 2 /^T-iy • • • > 2^nd ay = 1 + + . . . + 0^^ . 

In the case of higher order MA disturbances, the development of the 
explicit form of the H matrix appears tedious and indeed has not even been 
developed for the general C 2 ise, i.e., the case of H not being restricted to be 
an upper-triangular matrix. In this case, we propose numerical inversion and 
decomposition oiV. In practice, of course, to implement this exact estimator 
we need to estimate $ — as indeed we do to implement HSi, HSs, and HS 4 . 
For this, we propose one of the following four methods: (i) estimate (6) 
by some consistent procedure such as NL2SLS and then use Box-Jenkins 
methods on the residuals, (ii) use a grid search over the range of possible 
values of (iii) use a new method initially put forward by Ullah et al. (1985) 
for the regression case and recently extended to the simultaneous equations 
case by Ullah and Power (1985), and (iv) estimate (6) by some consistent 
procedure such as NL2SLS and then use a method of moments estimator 
due to Fuller (1976). 

An alternative estimation method which is also based on the variance- 
covariance matrix has been used by Fair (1984a). This method es- 
timates the matrix a^V by a tri-diagonal matrix in which the elements 
on the leading diagonal are where the Vt^s are the resid- 

uals from consistent estimation of (6), and the olf-diagonal elements are 
(1/(T — 1)) Having constructed this matrix estimate one then 

inverts and decomposes it as above to provide a suitable upper-triangular 
transformation matrix for (6). 

All of the above single equation errors in variables estimators may be 
extended ezisily to the general case of an ARMA(p, q) structural disturbance 
Ut and a general or t^t+y rational expectation, or indeed to the 

case where more than one rationad expectation appears in the equation. 
The procedure is to replace all the rational expectations terms with their 
realized values, to quasi-difference the resulting equation until the composite 
disturbance is free of any AR component, to establish the order of the MA 
process describing the new composite disturbance, and to proceed as above. 
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5. ASYMPTOTIC PROPERTIES 

In this section we derive the asymptotic properties of the various new 
single equation errors in variables estimators that we have outlined in Sec- 
tion 4. We begin by showing the asymptotic properties of the Hayashi and 
Sims (1983) estimator and then prove that our various new estimators are 
asymptotically equivalent. Finally, we present a consistency proof for the 
Fair (1984a) estimator. 

Following Hayashi and Sims (1983), we restrict ourselves to the linear 
csuse; that is, we shall be considering the estimation of: 



Yt = Qt8 + vt (24) 

under the identifying assumption 

I vt-jv, . . .yFt^Ft-u- • •] = 0 (25) 

and exclude the case of unit roots in the composite disturbance. In addition, 
we make the following two general assumptions: (i) there exists H > h 
instrumental variables, Ft, and a positive integer N satisfying the identifying 
assumption (25), and (ii) Yi, Qt, and Ft are jointly covariance-stationary 
and ergodic processes. 

We begin by considering the Hayashi and Sims (1983) estimator. Recall 
that above we have used the notation E{vv*) = where V = FF', and 
that the transformation matrix was F~^. 

Proposition 1. Let = F”^Q, Y'^ = F“^y, and = F“^v. Assume: 

(a) E[F{Ft] is positive definite. 

(b) E[Q^*Ft] is of full row rank. 

(c) \im{l/T)E[F*v^v'^^F] = A exists and is positive definite. 

Then the Hayashi and Sims estimator 6hs = F{F^ F)~^ F^Q'^)~^Q'^^ F 

{F^F)"^F^Y'^y with known V, is consistent and vT{6hs ~ ^) converges 
in distribution to a normal random variable with mean 0 and variance- 
covariance matrix (72plim[(g+'F/T)(F’'F/r)“-H^'Q‘^/^)]”^* 

Proof. Consistency. 6 hs = [Q'^*F{F'F)-^F^Q^]^^Q^^F{F^F)-^F‘Y-^ 
implies that 6hs = F + This, in 

turn, implies that plim 8hs = 8-]-piim[Q'^^ F{F^ F)~'^F^Q'^]~'^Q^* F{F^ F)'~^ 
By ergodicity, plim(F'v*"/T) = E{Fiv^) and by the identifying as- 
sumption (25), E{Flv^) = 0. Hence, if one uses this result and assumptions 
(a) and (b), then one can show that plim 8hs = 

Asymptotic Normality Consider VT{8hs - 8) = [Q'^'F(F'F)^^F'Q'*']~^ 
Q-^^F(F'F)~'^(F'v'^/\/T). By a central limit theorem used by Hansen 
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(1982), {F^v^/y/f) - iNT(0,A) where A = lim(l/T)£?[F'v+t;+'i^]. By 
this result and assumptions (a), (b), and (c), \/T( 6 hs — AT(0, er^plim 

[(g+'F/T)(i^'F/T)“i(F'g+/T)]-^ 

We now consider the case where V is unknown and consequently has 
to be estimated. In this czise it is not sufficient for asymptotic equivalence 
that V be estimated consistently. Before we can make such a statement we 
need to prove that: (i) plim{[(g*^'-g'‘"')F]/T} = 0 and (ii) plim{[F'(f - 
r~^)]/\/T} = 0, where is the estimated multiplied by Q and f”^ 
is the estimate of F“^. Haysishi and Sims (1983) have proved that these 
conditions are satisfied for their implementations of the 8 hs estimator, i.e., 
HSi and HS 2 y under the following assumptions: 

1.1 Yty Qt and Ft are jointly stationary, ergodic, of finite variance, linearly 
regular and of maximal rank. 

1.2 E[vt I = 0 for all s > 0. 

1.3 E[vtVs I Ft-a] = E[vtVa] for all s > 0. 

1.4 For gj = a(L“^)gt, E[Qf^Ft] is of full row rank. 

1.5 The “backwards innovation” is independent of Ft. 

II. T^/^max«{| gj - Ga \ /B{s)} is bounded in probability, where B(s) is 
positive and absolutely summable. 

We now consider our new estimators and the Fair (1984a) estimator. 

Proposition 2. The asymptotic properties of HSs are identical to those of 
HSiy i.e. plim y/T{HSs - HSi) = 0; hence they are also identical to those 
oiSns- 

Proof. Compare the transformation matrix involved in HSs with that for 
HSi. This matrix has s additional rows and, in addition, the zero elements 
in the upper triangular part of the matrix are replaced by additional terms 
as described above. As T goes to infinity, we may, following standard argu- 
ments, ignore the s additional observations corresponding to the additional 
s rows. Similarly, we may argue that as T goes to infinity, the additional 
upper triangular non-zero terms tend to zero and may be ignored. 

Proposition 3. The asymptotic properties of HS 4 are identical to those 
for HSsy i.e. plim y/T{HS 4 — HSs) = 0; hence they are also identical to 
those of 6 hs- 

Proof. Compare estimators ^54 and HSs> The only difference between 
them lies in the fact that HS 4 includes an artificial regressor, GJ, in the 
MA(1) example given in Section 4. As T tends to infinity, the terms in this 
artificial regressor tend to zero and hence asymptotically this regressor may 
be ignored. 
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Proposition 4. The asymptotic properties of HSs are identical to those 
for HS 2 i i.e. plim y/T{HSs - HS 2 ) = 0; hence they are also identical to 
those of 6hs • 

Proof. Compare the transformation matrices for HSs and HS 2 . The only 
difference is the additional 8 rows. However, as T tends to infinity, the s 
observations corresponding to these additional 8 rows may be ignored. 

We now come to the “exact transformation” estimator and the Fair (1984a) 
estimator. These estimators, it will be recalled, involve estimation of the 
matrices V and <r^V respectively. Given the fact that Hayashi and Sims 
(1983) have shown that the two sufficient conditions (i) and (ii) are satisfied 
for their implementations, it follows that it will be sufficient for us to show 
the consistency of these estimators’ estimates of V and respectively to 
ensure that their asymptotic properties are equivalent to those of 6hs • 

In the case of the “exact transformation” estimator, the elements of the 

V matrix are functions of the moving average parameters, and therefore the 

V matrix will be estimated consistently as long as the moving average param- 
eters are estimated consistently. The consistency of the moving average pa- 
rameter estimates follows from the fact that we use Box- Jenkins procedures 
on the consistently estimated residuals of (24), i.e.. Box- Jenkins procedures 
are either maximum likelihood or asymptotically equivalent to maximum 
likelihood. Alternatively, in the case where we estimate the moving average 
parameters using the method of Ullah and Power (1985), consistency may 
be proved as an extension of the results in Ullah et ai. (1985). 

Finally, we consider the Fair (1984a) estimator which estimates the a^V 
matrix slightly more formally. 

Proposition 5. Assume (a), of Proposition 1 and also the following: 

(d) E[Q[Ft] is of full row rank. 

(e) E[Qlvt] is positive definite. 

Then the Fair (1984a) estimator of the a^V matrix is consistent. 

Proof. We consider the MA(1) case for simplicity. Recall that the distur- 
bance term of equation (24) is denoted Vt, that the backward representation 
of Vt is vt = €t — and that we have referred above to the consistently 

estimated residuals from this equation as vty where the consistent estimator 
was 2SLS. The proof will be in four parts. Together they are sufficient. 

(I.) We first prove: plim[(l/T)v't;] = (1 + 

plim[(l/T)v'v] = plim[(l/T)Evt] = plim[(l/T) E^t] + 

j(l/r) - 20 plim[(l/r) = (1 + 0 ^)<r^- 

(II.) We next prove: plim[(l/(T - 1)) 
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plim[(l/(T - 1)) E = plim[(l/(T - 1)) X) CtCt-i] - 0 plim[(l/ 

(T - 1)) E ftet- 2 ] - e plim[(l/(T - 1)) E ^Li] + »*plim[(l/(T - 1)) 
E€t-l€t_2] = -Oa^. 

(III.) We next prove: plim[(l/T)t;'v] = plim[(l/T)v't;]. 

If M = [/ - Q{Q'F{F'F)-^F'Q)-^Q'F{F'F)-^F'] then T~Wv = 
T-^v'Mv = T-^v'v - (v'Q/T)[{Q'F/T){F'F/T)-^F'Q/T)]-^ 
{Q'F/T){F'F/T)-^{F’v/T). This implies that plim[(l/T)v't;] = 
plim[(l/T)t;'v] - plim{{v'Q/T)[{Q'F/T){F>F/T)-^F'Q/T)]-^ x 
(Q'F/T){F'F/T)-^{F'v/T)}. 

Consider plim(J’'v/T). By ergodicity plim(f’'w/r) = E[Flvt] and 
by the identifying assumption, £?[F/wt] = 0. Therefore these results 
and the assumptions (a), (d), and (e) imply that plim[(l/r)w'v] = 
plim[(l/T)v'v]. 

(IV.) We finally prove: plim[(l/(T — l))E^tVt_i] = plim[(l/(T — 

Let B he BL T X T matrix whose first row consists of zeros 
and whose row (i > 2) has a 1 in the (* - 1)*^ posi- 
tion and zeros elsewhere. Then plim[(l/(T - 1)) VtVt-i] = 
plim[(l/T)t;'Bt;] = plim[(l/T)t;'MJ5Mv] = plim[(l/T)t;'J5v] = 
plim[(l/(T - 1)) X) VtVt-ij, where the third equality follows by ar- 
guments similar to those used in (III). 
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LINEAR WALD METHODS FOR INFERENCE 
ON COVARIANCES AND WEAK EXOGENEITY 
TESTS IN STRUCTURAL EQUATIONS 

ABSTRACT 

Inference about the vector of covariances between the stochastic explana- 
tory variables and the disturbance term of a structural equation is an im- 
portant problem in econometrics. For example, one may wish to test the 
independence between stochastic explanatory variables and the disturbance 
term. Tests for the hypothesis of independence between the full vector of 
stochastic explanatory variables and the disturbance have been proposed 
by several authors. When more than one stochastic explanatory variable is 
involved, it can be of interest to determine whether all of them are inde- 
pendent of the disturbance and, if not, which ones are. We develop simple 
large-sample methods which allow us to construct confidence regions and 
test hypotheses concerning any vector of linear transformations of the co- 
variances between the stochastic explanatory variables and the disturbance 
of a structural equation. The main method described is a generalized Wald 
procedure which simply requires two linear regressions. No nonlinear estima- 
tion is needed. Consistent tests for weak exogeneity hypotheses are derived 
3s special cases. 



1. INTRODUCTION 

Inference about the vector of covariances between the stochastic explana- 
tory variables and the disturbance term of a structural equation is an impor- 
tant problem in econometrics. For example, one may wish to test whether 
a set of stochaustic explanatory variables are statistically independent of the 
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disturbance of a structural equation, i.e., whether the stochastic explanatory 
variables considered can be treated as “exogenous” (or “predetermined”). 
In particular, it is well known that independence between explanatory vari- 
ables and disturbances is usually needed to ensure that standard inference 
procedures, like ordinary least squares or F-tests, are appropriate in linear 
models. Furthermore, a number of economic hypotheses can be formulated 
in terms of the independence between stochastic explanatory variables and 
disturbances.^ 

Tests for the hypothesis of independence between a vector of stochas- 
tic explanatory variables and a disturbance term were proposed by sev- 
eral authors; see Durbin (1954), Wu (1973, 1974), Revankar and Hartley 
(1973), Farebrother (1976), Hausman (1978), Revankar (1978), Kariya and 
Hodoshima (1980), Richard (1980), and Holly and Sargan (1982).^ These 
articles deal especially with the problem of testing whether the full vector 
of stochastic explanatory variables is independent of the disturbance. When 
more than one stochastic explanatory variable are involved, it is often nec- 
essary to determine whether all of them are independent of the disturbances 
and, if not, which ones are. This can be useful, for example, to check the 
specification of a simultaneous equation model (e.g., block recursiveness as- 
sumptions) and to get more efficient estimators for such models. 

Tests for the hypothesis of independence between a subset of stochastic 
explanatory variables and the disturbance in a structural equation have been 
proposed by a number of authors: Hwang (1980) and Smith (1984) studied 
likelihood ratio (LR) tests, Hausman and Taylor (1981a), Spencer and Berk 
(1981) and Wu (1983b) proposed extensions of the tests previously studied 
by Wu (1973) and by Hausman (1978), while Engle (1982) derived Lagrange 
multiplier (LM) tests. 

Each of these procedures has important drawbacks, either practical or 
theoretical. Some of them require nonlinear estimation, e.g., LR tests and 
certain forms of the LM tests. All of them require a separate estimation for 
each null hypothesis tested. It is difficult to construct confidence intervals 
for the covariances of interest because covariance estimates or their standard 
errors are not typically produced. 

^ See Wu (1973). For an example of a structural equation where the stochastic 
explanatory variables can be treated as “exogenous”, see Zellner et al. (1966). 

^ Further useful discussions and extensions of these tests are provided by Bron- 
sard and Salveis-Bronsard (1984), Engle (1982, 1984), Gouri4roux and Trognon 
(1984), Hausman and Taylor (1980, 198 la, b), Holly (1980, 1982a, b, 1983), Holly 
and Monfort (1982), Nakamura and Nakamura (1980, 1981), Plosser et aL (1982), 
Reynolds (1982), Riess (1983), Ruud (1984), Tsurumi and Shiba (1984), Turking- 
ton (1980), White (1982), and Wu (1983a). 
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Hausman-type tests are better viewed as consistency tests. By com- 
paring an efficient estimator under the null hypothesis with a consistent 
estimator under the alternative hypothesis, one checks whether the con- 
strained estimator is consistent (see Holly, 1982a,b; Hausman and Taylor, 
1980, 1981b). When testing exogeneity, this is not equivalent to testing inde- 
pendence between possible endogenous variables and the disturbance term: 
the condition tested is weaker (unless special assumptions hold] and the test 
may not be consistent. (This is easy to see from Hausman and Taylor (1980, 
1981b) and Wu (1983b).) Even though this condition may be sufficient to 
ensure the consistency or the efficiency of the constrained estimator, it is 
not generally sufficient to guarantee the validity of inferences obtained from 
the model by treating the regressors whose exogeneity is in doubt as being 
exogenous: tests and confidence intervals pertaining to the various coeffi- 
cients of the model may not have the correct levels, even asymptotically.** 
In many if not most practical situations, the relevant hypothesis is whether 
one can treat some stochaustic explanatory variables os being exogenous for 
all purposes of inference (i.e., the independence aissumption). 

In this paper, we consider a single linear structural equation and develop 
a class of linear Wald-type procedures which allow us to construct confidence 
regions as well as to test any set of linear restrictions on the vector of covari- 
ances between the stochastic explanatory variables and the disturbance term 
in the equation. Besides a set of instrumental regressions, all that is needed 
is a simple linear regression which yields consistent estimates of both the 
structural coefficients in the equation and the relevant vector of covariances. 
The asymptotic covariance matrix of the coefficients is then easily obtained. 
Using these results, one can test any set of linear restrictions on the co- 
variances and construct confidence regions. Cross-restrictions between the 
structural coefficients and the covariances may also be tested. Special cases 
of this family of tests include tests of zero restrictions on the covariances, 
either for individual covariances or subvectors of covariances. In particular, 
one can compute in a routine way asymptotic “t- values” for each covariance, 
an especially convenient instrument to explore the recursiveness properties 
of a model. All the tests suggested are consistent. 

Because they are based on consistent asymptotically normal estimators 
different from the maximum-likelihood estimators (Wald, 1943), the tests 
developed here should be viewed as generalized Wald tests rather than Wald 
tests in the usual sense (see Stroud, 1971; Szroeter, 1983). We will not need 
the information matrix associated with the maximum likelihood estimators. 
As we shall see below, the tests proposed can be obtained as a byproduct of 



^ See White (1982, p. 16), and Breusch and Mizon (comment to Ruud, 1984, p. 
249). 
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the estimation of a structural equation by any instrumental-variable method 
(including two-stage least squares). They thus have a natural complemen- 
tary with the latter estimation method. 

In Section 2, we formulate the model considered and the assumptions 
used. In Section 3, we describe the procedures proposed and formulate 
the theorems underlying them. In Section 4, we examine three important 
special situations: the case where we want to test independence between the 
full vector of stochastic explanatory variables and the disturbance term, the 
one where a subset of stochastic explanatory variables is taken a priori as 
being exogenous and the case where the matrix of instruments includes all 
the fixed (or exogenous) regressors in the equation considered. In Section 5, 
we discuss econometric applications. Finally, in Section 6, we provide the 
proofs of the theorems. 



2. FRAMEWORK 



We consider the model described by the following assumptions. 
ASSUMPTION 1: 

y = + + (2.1) 

where y is a T x 1 random vector, u is a T X 1 vector of disturbances, Y 
is a T X C matrix of stochastic explanatory variables, is a T x Ki non- 
stochastic matrix of rank K\y p and 7 are G x 1 and Ki x 1 vectors of 
coefficients. 

ASSUMPTION 2: 

r = zn + y, (2.2) 

where Z is a T X if non-stochastic matrix of rank if , II is a FC x G matrix 
of coefficients and V is sl T x G matrix of disturbances. Furthermore, we 
will denote by yk, Iljt and Wk the A;th columns of the matrices K, II and V 
respectively (1 < A: < G): 

= [yi, . . . , yo], n = [Hi, . . . , U g], y = [«;i, . . . , (2.3) 



ASSUMPTION 3: The rows (ut,Vt), t = 1,...,T, of the matrix [« : V] 
are independent and normally distributed with mean zero and non-singular 
covariance matrix 

r <^00 S' i 



s = 



(2.4) 



8 S22 



where 



s = (ffOl) S22 — [<^jk\ j, k=l,...,G. 



(2.5) 
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ASSUMPTION 4: Let Z = [Zu : Z^\ and U = [n'n : n^]', where Z2 
is the T X K 2 matrix of non-stochastic variables excluded from equation 
(2.1), ri2 is the G X K2 corresponding matrix of coefficients, Zn is a set of 
variables included in Zi, so that Y = ^ullii + ^21^2 + rank(IT2) = G 
and T > 2G + 

[This condition ensures identification of the coefficients of equation (2.1); see 
Fisher (1966, p. 53). Note also that Z\ is not constrained to be a submatrix 
oiZ] 

ASSUMPTION 5: The matrix converges, as T — > 00, to a positive 

definite matrix Qg. 

ASSUMPTION 6: The matrices ^Z'Z\ converge, as T 00, 

to the matrices Qn and Qi respectively, where Qn is positive definite. 

We want to test some set of linear restrictions on the parameter vector 
6, i.e., a hypothesis of the type 



Ho : H6 = do, (2.6) 

where is an r x G matrix of rank r <G and do is a fixed r X 1 vector. Since 
the vectors (ut, v{)', t = 1, . . .,T, are i.i.d. normal, we obtain by regressing 
ut on vt: 

u = Va + e, (2.7) 

where a = ^^ 22 ^ vector e is N[0,(r^lT] independent of all the ele- 

ments of Then, substituting (2.7) into (2.1), we get 

y = Yp-]-Zi^ + Va + e, ( 2 . 8 ) 

where the disturbance vector e is independent of all the regressors. The lat- 
ter formulation illustrates clearly that the existence of correlation between 
some of the regressors and the disturbance term in an econometric relation- 
ship, as generated, for example, by simultaneous equations, may be viewed 
as a problem of omitted variables. If the matrix V were observed, we would 
test any set of linear restrictions on the coefficients 7 and a in equation 
(2.8) by standard F-tests, and these tests would be exact in small samples. 
In particular, linear hypotheses regarding the parameter vector a could be 
tested by using the least squares estimate a obtained from (2.8). Further- 
more, if S22 were also known, the transformation 6 = E22fl would allow 
one to test Ho : H6 = do hy Sk standard F-test. The difficulty, of course, is 
that neither V nor S22 are known. We also note that, although hypothe- 
ses regarding 6 have relatively direct and intuitive interpretations (e.g., in 

^ This transformation is also used by Revankar (1973), Revankar and Hartley 
(1973) and Reynolds (1982). 
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terms of independence) , the auxiliary parameter vector a = itself may 

be of interest. One may wish to test linear restrictions on a directly in the 
reparameterized model (2.8). In any event, we will deal with both problems. 

We will first consider the problem of testing arbitrary linear restrictions 
on the parameter vector ot = (/?^7^a0^ restrictions on the covari- 

ance vector 8. In each case, we will first define a vector of linear consistent 
asymptotically normal estimators, derive the asymptotic covariance matrix 
and propose generalized Wald tests. In particular, we will derive the asymp- 
totic distribution of the covariance estimator 6 under both the null and the 
alternative hypotheses. As a special case, it will be straightforward to test 
zero restrictions on 8, for example, Hq : 8i = 0 where 8 = (^(>^2)^* 
context of the model considered here, the hypothesis = 0 is equivalent 
to the independence between Yi and u, where Y = [Yi:y2], or the weak 
exogeneity of Yi inside equation (2.1).® Further, from the same results, it is 
easy to construct a confidence region for any element or subvector of 5 or a. 

3. DESCRIPTION OF THE TESTS 

In equation (2.8), replace the disturbance matrix V by the corresponding 
ordinary least squares (OLS) residuals 

V = Y-Zfl, (3.1) 

where H = {Z^Z)~^Z'Y . We obtain in this way the equivalent equation 

y = Yfi + Zi^ + Va + e* =Xa + e\ (3.2) 

where X = [Y : Zi : V], a = (^',7', a')' and 

e* = Z(n - n)a + e. (3.3) 

Also let 

S22 = V'V/T, (3.4) 

Under Assumptions 2 through 6, we have 

ZV ^ , 

plim— ^ = 0, plim E22 = ^22 (3.5) 



® For a general discussion of exogeneity and related notions, see Engle et al. 
(1983). 
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(where plim refers to the probability limit 2 is T — ► oo), hence 



and 





+ S22 


n'<?i 


S22' 


^ ,. X'X 

Q, =phm y = 


Qin 


< 3 ii 


0' 




.E22 


0 


S22- 


Qzx = plim 


rp =[Q*n: 


Qi ■■ 0], 





(3.6) 



(3.7) 



where rank(Qa;) = L = 2G Ki. Consider the OLS estimate of a obtained 
from (3.2): 

a = {X'Xy^X'y. (3.8) 



Under the assumptions made, this estimate is unique with probability one. 
Further, the asymptotic distribution of a is given by the following theorem. 
(The proofs of the theorems are given in Section 5.) 



Theorem 1. Suppose that Assumptions 1 through 6 are satisfied, and let 
matrix Qx defined in (3.6) be non-singular. Then the estimator a given in 
(3.8) is consistent for a and y/T{a — a) has a normal limiting distribution 
with mean zero and covariance matrix 



— Qx "b PQxxQz Q«aj] Qx 



where Qzx is given by (3.7) and 

P — q!Yj22^ ~ 



(3.9) 



(3.10) 



Further, the statistics 



al = [y-Xciy{y-Xa)lT 



(3.11) 



and 



(X'X\~'^ ^^fX'X\ JX'Z\( 



Z'Z \ 

~^J 





(3.12) 



are consistent estimators of <t* and S,,, where p = a'f) 22 d, S 22 = V'V /T 
and d is the estimate of a from a. 
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We can test any set of linear restrictions on the vector a, such as M a 
mo, where M is a i/x L matrix of rank u < L and mo is a i/ x 1 fixed vector, 
by using a critical region of the form {S(M, mo) > c}, where 

S{M, mo) = T{Ma - mo)'(MSaM')~^(Md - mo) (3.13) 

and c is a constant which depends on the level of the test. The asymptotic 
distribution of the test statistic 5(M, mo) is chi-square with u degrees of 
freedom under the null hypothesis. 

Since the coefiicient a is of special interest here, it will be useful to 
summarize the asymptotic properties of a by the following corollary. 

Corollary 1.1. Under the assumptions of Theorem 1, the subvector a of 
a = is a consistent estimator of a and \/T(d — a) has a normal 

limiting distribution with mean zero and covariance matrix: 



Sa = A 2 [alQ^ + 4. (3-14) 



where A 2 = plim(C' 2 ) and C 2 is the G x (2G + Ki) matrix such that 

Cl' 



m"- 



C2 



(3.15) 



Further, the submatrix 
±a=C2 



.2 fx'x\ . /'x'z\ fz'zY^ (z'x\ 



C2 



(3.16) 



in (3.12) is a consistent estimator of Sa- 

Of course, tests of linear restrictions on a are special cases of the tests 
given by (3.13). However, if our interest lies in b rather than a = E ^2^5, the 
estimator directly relevant to us is not d. We need an estimator of d. Since 
d and E 22 are consistent estimators of a and E 22 ) d = 1 ) 22 ^ is a consistent 
estimator of 5. The asymptotic distribution of 6 is given by the following 
theorem. 



Theorem 2. Under the assumptions of Theorem 1, the estimator 6 = Il 22 d 
is consistent for 6 and the vector \/T(d— d) has a normal limiting distribution, 
as T — ► 00 , with mean zero and covariance matrix 



— E22S0E22 + P ^22 "b 



( 3 . 17 ) 
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where Ea is given by (3.16). Further, a consistent estimator of is provided 

by 

= S22^a^22 d" P^22 "h (3.18) 

where p and S 22 are defined in Theorem 1. 

Consequently, we can test the hypothesis Hq : H6 = do, where H is a. 
r X G matrix of rank r < G and do is a fixed r x 1 vector, by using a critical 
region of the form {W{Hydo) > c}, where 

W{H, do) = T{H6 - doYiHtsHT^HS - do) (3.19) 

and c depends on the level of the test. The asymptotic distribution of the 
statistic W {H, do), under Ho, is chi-square with r degrees of freedom. Again, 
this test is valid for large samples. 

Concerning the power of the above tests, we can make the important 
observation that they are consistent whenever Ma ^ mo oi H6 ^ do (see 
Section 6.5).^^ Besides, by considering complements of the critical regions 
described above, we can obtain confidence regions for Maoi HS, for example 
confidence intervals for the individual covariances in 6, 



3. SPECIAL CASES 

We will now examine three cases of special interest. First, consider the 
situation where the null hypothesis is ^0 • ^ = 0 or equivalently, Hq : a = 0. 
Under Ho, we can rewrite equation (3.2) as 

y = -f -j-Va-he, ( 4 . 1 ) 

where e follows a AT[0,ag/r] distribution and is independent of both Y and 
V. Then the standard F-statistic for testing a = 0 is 

a'(V'MiV)a/G 
~ e'e/(T - K1-2G)' 

where Mi = It - Xi{X[Xi)~^X[ and Xi = [Y : Zi\; under Ho, F follows a 
Fisher distribution with [G,T - Ki~ 2G) degrees of freedom. The resulting 
test is exact rather than asymptotic.® It is not equivalent (even asymptot- 
ically) to the test of a = 0 based on the statistic So = Td'S“^a, obtained 

^ This property is especially important in view of Holly’s (1982a) discussion of 
Hausman-type tests. 

® One can see eeusily that this test is equivalent to a test suggested by Wu (1973, 
T 2 statistic) and, in a different form, by Hausman (1978, eq. 2.23), except that 
Zi is not necessarily a submatrix of Z. On alternative forms of the Wu-Hausman 
test, see Nakamura and Nakamura (1981). 
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from (3.13). The main difference is that p is set to zero in the estimator of 
Ea in (3.16). This restriction is justified under Fq, for then p = a'E22« = 0. 
If we write F in the form 

^ = h 

where 

ta = al{V'MrV/T)-\ al = e'el{T-K^-2G), (4.4) 

we see easily that the statistics F and Sq/G are asymptotically identical 
under Ho (since p — ► 0). Nevertheless, under the alternative, this equivalence 
does not hold because p does not, in general, converge to zero. 

The second problem we wish to examine is to test whether a subset of 
the variables in Y are independent of u, conditional on the assumption that 
the others are independent of u. More precisely, given Y = [Yi : Y2], we 
want to test whether Y\ and u are independent, knowing that Y2 and u 
are independent. To do this, we can simply include Y2 in Z\ ojid reshape 
equation (2.1) accordingly: 



y = Yifii -h Z373 + ti, (4.5) 

where Z3 = [Y 2 : ^1], 73 = (i^2>70^ ^tnd ^ = (/?i,i^2)^ is the partition of ^ 
corresponding to [Yi : Y2]- We then proceed as previously on the transformed 
model. 

Finally, consider the important case where the matrix Z\ is a submatrix 
of Z, say Z = [Zi : Z2]. This is probably the most frequent situation 
when (2.1) is viewed as a “structural equation” (presumably inside some 
system of equations) and (2.2) represents the “reduced-form equation” for 
the endogenous variables appearing on the right-hand side of (2.1). In this 
case, the estimates P and 7, obtained from the regression given by (3.3), are 
the two-stage least squares (2SLS) estimates of and 7. To see this, rewrite 
equation (3.2) as 

y = y)^ + Zi7 + V^a* + e*, (4.6) 

where a* = a p. By the orthogonality relations V*Y = 0 and V'Zi = 0, 
the estimates of P and 7 obtained by OLS from (4.6) are identical to those 
obtained from the regression 

y = Yp-\-Zi^ + e*\ (4.7) 

They are thus identical to the 2SLS estimates of P and 7, showing clearly 
that the linear Wald tests described above have a natural complementary 
with the estimation of a structural equation by 2SLS. 
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In the same special ca^e, the estimate 6 used in Theorem 2 may be 
derived in a second interesting manner. Using again the orthogonality rela- 
tions, we see that 

hence 

a = a* - P 

and 

^ - ^ 22 / 5 . 

Further, substitute (2.2) into (2.1) to get the reduced-form equation for y: 

y = zap + Zi 7 -h Vo, 

where vq —V P + u. If we denote the tth. element of vq by vot = v'^P + ut 
and define ojq = E[vtVot], we have 

6 =■ (jJq — Yj22P‘ 

Since cvq can be consistently estimated hy u = ^V'voy where vq is the vector 
of residuals from the regression of y on Z, this suggests the following estimate 
of 

6 = U?o — ^22P) 

where P is a, consistent estimate of p. Then, if we take P = P^ the 2SLS 
estimate of we see that 

S=^V'y-'t22^ = S, (4.8) 

which shows that the estimator 6 can be generated in a second natural 
manner. 



5. ECONOMETRIC APPLICATIONS 

As previously indicated, assumptions concerning the independence of 
various stochastic explanatory variables in a structural equation and the dis- 
turbance term can have important implications for the appropriate choice 
of method of inference. On the one hand, if all the stochastic explanatory 
variables are correlated with the disturbance term, OLS does not usually 
provide consistent estimates of the structural coefficients in the equation 
and, even more generally, standard inference techniques (like F-tests) are 
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not valid; one should use a simultaneous equations technique (e.g., instru- 
mental variables). On the other hand, if they are all independent of the 
disturbance term, standard linear regression techniques (OLS, F-tests) are 
appropriate. Furthermore, between these two extremes, several intermediate 
cases are possible. If some but not all stochastic explanatory variables are 
independent of the disturbance term, standard inference techniques are not 
generally valid. However, we can exploit this information to get a more effi- 
cient method. In particular, if we split the matrix of stochastic explanatory 
variables into two submatrices Y = \Vi : ¥ 2 ]^ where Y 2 is independent of 
u, we can get more efficient consistent estimators and more powerful tests 
by treating Y 2 as exogenous: in particular, this can be done by using Y 2 as 
an additional set of instruments or, at least, by not replacing I 2 by Y 2 (see 
Maddala, 1977, pp. 477-478). 

The procedures developed above allow one to test the exogeneity of each 
stochastic explanatory variable included in a given equation by looking at 
asymptotic values. It is easy to compute these in a routine way while 
estimating the equation by an instrumental- variable method. In this manner, 
one can get automatic indications on the simultaneity properties of a model 
and possible ways of improving estimation efficiency. 

Finally, we may observe that a number of economic hypotheses can be 
formulated in terms of the independence between certain stochaistic explana- 
tory variables and the error term in an equation. Wu (1973) described a 
number of such cases, such as the permanent income hypothesis, the ex- 
pected profit maximization hypothesis, and the recursiveness hypothesis in 
simultaneous equation models. 



6. PROOFS 



6.1 Proof of Theorem 1 

First, from (3.2), (3.3) and (3.8), we have the identity: 

Vt(«-«)=(^) 'er. (6.1) 

where 

er = ® - n)o. 

Moreover, we can see that 

(n - n)a = {Z'Z)-^Z'Va-, 
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hence £^[(n - IT)a] = 0 and 

E[{f[ - n)aa'(n - n)'] = p{Z^Z)-\ 

where p = a'Il 220 , for, from Assumption 3, it is easily verified that 
E[Vaa^V^] = plx- Consequently, 



\/r(n - n)o ~ jv 




( 6 . 2 ) 



Also, since e is independent of Y, the distribution of -^X'e, conditional on 

y,isJV[o,aK^)]. 

Consider the characteristic function of ey, 



= S{exp[iV'er]} 

= ^;|exp ir'^X'e + ir' (^^^Vf{n-n)a J, 



where t G ^ _ ydj jn order to get an explicit expression for 

^r(r), we first compute the expected value of exp{iV'er} conditional on Y. 
Since is normal for given F, we have 



Rt(t) = £7[exp(iV'eT) | Y] - 



where 



and 



^ r ^(0 = exp 



v^(n - n)o 

Then, using (3.6) and (6.2), we see that 

plim = exp • 



Also, from (3.7), (6.2) and Assumption 5, we have 

(r)-^ exp{tV'B}, 




330 



JEAN-MARIE DUFOUR 



where B ^ N[0^ pQ^^Q~^Qzx] and refers to convergence in distribution 
as T — > oo. Consequently, 

i?r(r)-^exp exp(tV'S); 

hence, by the Helly-Bray Theorem, 

lim E{Rt{t)} - exp £;{exp(«V'B)} 

T—^oo \ ^ / 

= exp {-^r' {alQ:, + pQ'>,Jiz^Qzx) rj , 

for all T. Since ^r(^) = it follows that ^r(^) converges to the 

characteristic function of the distribution. There- 

fore, 

er — > N[0, alQ, + pQ',,Q~^Qzx] (6.3) 

and, using ( 6 . 1 ), 

^/^(a-«)^^■[0,S«], (6.4) 

where is given by (3.9). The consistency of a follows from (6.4). Con- 
cerning the estimator al , we can write 

T T[y/Tj [ T J [y/rj’ 



hence 



plim dg = plim 



Moreover, by the definition of e* in (3.3), we have 

^ = Y + 2a'(n - n)'^ + a'(n - n)' (n - n)«; 



hence, since plim (Z'e/T) = 0 and plim (ft - II)a = 0, 
plim - plim ^ = < 7 ^, 

which shows that is a consistent estimator of . Finally, we can see that 
Eck is a consistent estimator of E« by considering the definitions of Qa; and 
Qzx) and by noting that p and are consistent for p and Q.E.D. 
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6.2 Proof of Corollary 1.1 

The consistency of a follows from the consistency of a. The asymptotic 
distribution of y/T{a — a) follows from the identity 

Vf{a-a) = C2(^-^X'e^^ (6.5) 

and from (6.3). The consistency of Ea follows from the consistency of E« 
and the definition of A 2 . Q.E.D. 



6.3 Lemma 

In order to obtain the asymptotic distribution of ^ = 1 ) 22 ^) we will need 
the following lemma. 



Lemma 1. Suppose that Assumptions 2, 3 and 5 are satisfied. Let and 
be the tth rows of S 22 and E 22 , respectively (1 = 1, . . and 

Then, the vector \/r(a — <r) has a normal limiting distribution, as T — > oo, 
with mean zero and covariance matrix 



rv^, 






11 



Lvoi 



Vgg J 



( 6 . 6 ) 



where Vij = Furthermore, the vector \/T(E 22 “ ^ 22 )^, where 

c is any fixed G x 1 vector, has a normal limiting distribution with mean 
zero and covariance matrix 



tpc = (c'S22c) S 22 + (S 22 C) (E22c)^ . (6.7) 



Proof. Let Wi = M^Wi, where Wi is the ith column of the matrix V and 
Mg = Ix - Z{Z'Z)~^Z‘ . The (i,i)th element of S 22 has the form = 
w^-Wj/T] hence 

. ^ w(w,- 1 f Z'wi fz'zy^ f Z'wj \ 

T T\^) \ T ) \^ )■ 

Let Uij = w'iWj/Tyffi = (of<i,5’i2, --,^iG)', 3 = and a = 

(a( , • • • ) ^a)'- Then, using Assumptions 3 and 5, we get 

plim [VT {oij - ffij) - Vt [ffij - <7iy)| =0, t , y = 1, . . . , G. 
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Thus, the vectors \/T{a - a) and y/T{a — a) have the same limiting distri- 
bution. 

Let Wit be the tth element of Wi and define Sa = {wawu, WitW 2 t, • • • , 
WGtYy i = St = . . . , t = 1, . . . , T. R is clear that 

the vectors St, t = 1, . . . , T, are independent and identically distributed with 
mean a. Furthermore, since 

E [(Sit - (Ti) (Syt - <Tj)'] = [(rij(Tkt + k, £=i a 

= CTij^22 + (rjtr'i, *,j = 1, . . . , G, 

for all t (see Anderson, 1958, p. 39), the covariance matrix of St is as 
given in (6.6). Thus, since 

Vr(cr -a) = ^ ^(St - <t), 

and using the Multivariate Central Limit Theorem (see Anderson, 1958, 
Theorem 4.2.3), we can conclude that the limiting distribution of y/T{a — a) 
is A[0,E^y]. Furthermore, for any C X 1 fixed vector c, 

\/T(E 22 - ^ 22 )c = {Ig ® c') Vf {a -a), 

where Ig is the identity matrix of order G and 0 refers to the Kronecker 
product. Since VT{a — a) is asymptotically iV[0,S^], we can conclude that 
the vector >/T (^22 ~ ^ 22 )^ is asymptotically normal with mean zero and 
covariance matrix 

V^c = (f^G 0 c^) {Ig 0 c) 

= [c'Vijc]^^ y=i,...,G • 

We see easily that tpc reduces to the expression in (6.7). Q.E.D. 



6.4 Proof of Theorem 2 

First, note that the vector y/T{6 — ^) can be decomposed as follows: 
y/r{6 — 6) = \/TE22(fl “ a) + \/T(S22 ~ ^ 22 )^ 

= E22G2 + (^) v^(n - n)a 

-|- y/r ^1)22 “ ^ 22 ^ 

where we have used (6.5) and (3.3). Let 

Wt = E 22 A 2 ^A'e-f-g;,^\/T(n-n)a + (S22 - S22) a. 
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Since plim(S 22 C' 2 ) = E 22 A 2 and plim(X'Z/T) = we have plim[>/T(5- 
6)—Wt] = 0; y/T{6 — 8) and Wt must have the same asymptotic distribution 
(see Billingsley, 1968, p. 25, Theorem 4.1). Consider now the characteristic 
function of Wt , 

<t>T{T) = £?{exp[iV'Wr]}, 

where t G . Since e is independent of V, -^X^e ~ ^[0,<r*(^)]fory 
fixed and, by taking the expected value of exp(«VWj’) conditional on Y, we 
get 

St{t) = £;{exp(.V'iyr) | Y} = 

where 

^(r) = exp |-i<7*r'S22A2 ^ 2 S 22 r| 

and 

= exp {ir'S 22 A 2 Q;*\/T(n - H)a + iV'\/r (±22 - S 22 ) a} • 
Furthermore, using (3.6), 

plim = exp |-i<r^J-'E22A2<5*A'2S22r| = (6.8) 

Consequently, by the Helly-Bray Theorem, 

S[Sr(r)l = E [5(?^r)] , (6.9) 



where the expectation E is taken over Y. 

Each column fty of ft is independent of each column it)*, of V, since 
^;[(ny-n,) it)*.] = 0, y , A; = 1, . . . , G. Therefore, ft and II 22 are independent 
and 



E { 4 *^’-)} = E {exp [»V'E22A2Qi,.VT(n - n)a] } 

X E {exp ^tV'\/r {e 22 - S22) aj { 

1 / Z' z\ 

~2^^^^22^2Q«x ( ) QzxA2^22'f 



= exp 



X E jexp [^iV'\/T ^E22 - S22) a] | , 
where the second identity comes from (6.2). By Lemma 1, 

VT ^S22 - S 22 ) 



( 6 . 10 ) 
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where tp = (a!'^22<^)^22 + {^22<^){^22(^y > Since p = a'S22fl and 8 = S22^) 
we see that tp = pJ^22 + Thus, the limit characteristic function of 
y/r{Yj 22 ~ S22)® is 

^lim |exp j^iV'Vr ^1)22 - S22) a] I = exp -^r^rpr . (6.11) 

Using (6.8)-(6.11), we obtain 

lim £?[5r(r)] 

T— ►©© 

= exp |~2^^ [^22>l2 {(^eQx + PQzxQz^Qzx) >^ 2^22 + V^] ^ | > 

which implies that the asymptotic distribution of y/T {8 — ^) is normal with 
mean zero and covariance matrix S5 = S22S0S22 + where is given 
by (3.14). Finally, the consistency of ^ = S22a follows from that of E22 
and d for S22 and a respectively, and the consistency of S5 follows from the 
consistency of S22, and ip for S22> So and p respectively. Q.E.D. 

6.5 Asymptotic Power 

We will now show that the tests discussed above are consistent. The 
statistic S(M,mo) used to test Ma = mo, where M is a 1/ x L matrix of 
rank i/, can be decomposed in the following way: 

S(M, mo) = 5i(Af, a) + \/TS2(M, a, mo), 

where 

5i(M,a) = T(Ma - Ma)'(AfSaM')‘"^(Md - Ma) 
converges to a chi-square distribution with 1/ degrees of freedom and 

52 (M, a, mo) 

= [2VTAf(a - a) + \/f(Ma - mo)]' (^MSaM'j (Ma - mo). 

We will show that plim S2(M, a, mo) = +00, whenever Ma ^ mo. 

Consider first the case where all the elements of the vector Ma — mo 
are different from zero. In the sum [2y/TM(d — a) + y/T{Ma — mo)], the 
second term always dominates as T — ► 00 and y/T{d — a) has a limiting 
distribution. Consequently 

plim 52(M,a,mo) = plim VT{Ma-moy (Ma-mo) = -f-00. 
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where the fact that plim = (AfEaAf')“^ is positive definite 

has been used. Second, for the case where Ma ^ mo but some elements of 
(Ma — mo) are zero, we can assume without loss of generality that these 
constitute the lower vector of (Ma — mo): 

Ma - mo = (d'i,0')'. 



where all the elements of the uiXl vector di are different from zero. Further- 
more, let us partition (d - a) and (MSaM')""^ conformably with (d'i,0')': 



(d-a)i 






All Ai2 
A21 A22 



where (d — o;)i is a i/i X 1 vector and An is a i/i X Vi positive definite matrix. 
Then 



S2{M,a,mo) = 2\/f(d-a)'iM'Audi+2VT(d-a)'2M'A2idi+\/TdiAiidi. 

Since plim (An) is a positive definite matrix and \/r(d — a) has a limiting 
distribution, we have 



plim 52 (M, a, mo) = -hoo. (6*12) 

Thus, (6.12) holds whenever Ma ^ mo, 

plim 5 (M, mo) = plim j^5i (M, a) + VTS 2 (M, a, mo) j = + 00 , 
whenever Ma ^ mo, and 



^lim P[5(M, mo) > c] = < ^ 



if Ma = mo 
if Ma ^ mo. 



(6.13) 



where e is the level of the test. This proves the consistency of the tests pro- 
posed for linear hypotheses regarding a. The consistency of tests regarding 
6 can be shown in a similar way. 
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THE FINITE SAMPLE MOMENTS OF OLS 
IN DYNAMIC MODELS WHEN DISTURBANCES 
ARE SMALL 

1. INTRODUCTION 

For many years, econometric researchers have been using Least Squares 
methods (LS) to estimate the coefficients in dynamic single equation econo- 
metric models. These methods are usually justified by citing research that 
proves LS estimators have desirable asymptotic properties when the errors 
are assumed to be independent and identically distributed. Seminal articles 
by Mann and Wald (1943), White (1958) and Anderson (1959) are among 
those cited. These articles prove that under the assumption of independent 
and identically distributed errors, LS is asymptotically unbiased, asymptot- 
ically efficient and consistent in the context of dynamic econometric models. 
However, in small samples typical of econometric research involving time 
series data, the distribution of the LS estimator in dynamic models is much 
less certain and is difficult to obtain. As a result, researchers who ignore 
these finite sample problems may sacrifice the accuracy of their conclusions. 

In this paper, I shall present both theoretical and numerical results on 
the bias and the Mean Squared Error (MSE) of the LS estimator of the au- 
toregressive coefficient in a first order stochastic difference equation with 
white noise normal errors. The theoretical results are divided into two 
groups; Exact Results and Small Disturbance Approximate Results. The 
exact formulae rely heavily on the assumption of a normal error structure 
while the approximate formulae only require the existence of the first four 
moments of the error distribution. Both sets of results are dependent on 
the characteristics of the model and hence are functions of all the unknown 
parameters and the exogenous data series that exist in the model. Conse- 
quently, these formulae are very general and are valid for any specification 
of the model parameters. However, in this general form, they are unable to 
provide any quantitative information about the sign direction and/or size of 
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the bias and MSE of the LS estimator. Thus, numerical evaluation of these 
formulae is necessary. 

Through the use of computers, the theoretical formulae are evaluated 
under alternative specifications for the unknown parameters and exogenous 
data. These results are presented in this paper and comparisons are made, 
not only among alternative parameter scenarios but also between the approx- 
imate results and the exact results. Although the numerical results are not 
exhaustive, they do provide some limited, yet important, information about 
the nature of the distribution of the LS estimator in a stochastic difference 
equation. 

The plan of this paper is els follows: Section 2 outlines the model and the 
estimators considered in this paper; Sections 3 and 4 present the theoretical 
and numerical results respectively; while Section 5 summarizes the general 
findings. 



2. MODEL AND ESTIMATOR 

In this paper, I consider a first-order stochastic difference equation with 
normally distributed, white noise errors. Specifically, 

y* = M + ivt-i + Xtp + «t, (2.1) 

where yt is the tth observation on the dependent variable. The parameter /x 
is an intercept that may or may not be included with Xt which is a (1 x if) 
vector of observations on K exogenous variables at time t. The ut is a 
random error term at time t with distribution iV’(0,cr2). The parameter 7 is 
the scalar autoregressive coefficient on the lagged dependent variable and /? 
is a (if X 1) vector of coefficients corresponding to the K exogenous variables 
in the model. 

Following the example of Evans and Savin (1984) it is convenient to use 
a transformed version of (2.1). Subtraction of the initial value, yo, from both 
sides of (2.1) results in the following model: 



zt = izt-i -\-S + XtP (TWt, 



( 2 . 2 ) 



where = {vt-i ~ Vo) i ^ zq 0, ut/a = wt ^ iV(0, 1), 6 = 
Voil “ 1) + M) yo = c > 0. This transformed version of model (2.1) is 
used because it allows us to write model (2.1) in matrix notation as follows: 



z = 7LZ + XP + o-tu. 



(2.3) 
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where 



Lz = j?_i, 






Ltxt '■ 



rO 0- 






X^ = [i:X] 



6 



and 



Rearrangment of (2.3) reveals that 



0 ■ 

0 

0 . 



^ = HXP + (tHw, (2.4) 

where H — [I — 7 L)”^, and X is {T x K 1). 

We have decomposed model (2.1) into its stochastic and non-stochastic 
components without making any assumptions about the stability of the au- 
toregressive portion of the model. That is, as long as (/ — 7 L) is invertible, 
the representation of model (2.1) in (2.4) is valid. 

In Section 4 it will be necessary to know the general form of the distri- 
bution for y and/or z. Thus, using the decomposed form of z in (2.4), the 
mean of z is seen to be 

z = E{z) = HXfi, (2.5) 

and the variance-covariance matrix is 



V{z) = E[iz-E{z)){z-E{z)y] 

X E[{HXP + (tHw - HXP){HXP + (tHxu - HXP)'] 

= E\{aHw){(THw)'\ 

= E{a^Hww'H') 

= = n. (2.6) 
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Therefore, due to the normality of w and to expressions (2.5) and (2.6), the 
distribution of z is seen to be 

N[HXp,n). (2.7) 

Having completely specified the model and derived the distribution of the 
dependent variable, we can now focus our attention on the form of the least 
squares estimator of 7. 

The least squares estimator of 7 is derived by minimizing the sum of the 
squared errors with respect to the true parameters. Thus, the LS estimator 
for 7 is 

_ _ i'-iMz _ y'-iMy 

^'Bz y'-iMy-i’ 

where 

M = I-X{X'X)-^X\ 

M = I-X{X'X)-^X\ 

A = VM, 

B = L'ML, 

and Zy z-iy y and y_i are defined in (2.1) and (2.4). M and M are idempo- 
tent matrices of rank IT -h 1. It is obvious that if yo = 0 then M = M with 
rank K 1 < T and z = y and z-i = y_i. Finally, it can be noted that B 
is idempotent of rank K and A is nonsymmetric. 

The model as specified in (2.3) — (2.8) will be used to derive the small 
disturbance asymptotic results in Section 3. However, when deriving the 
exact results, it will be much more convenient to use the canonical form 
of model (2.3) and estimator (2.8). Thus the following adjustments to the 
notation are necessary. Notice that B = VML is idempotent of rank K, 
Thus, there exists a T x T orthogonal matrix P such that 

0 ' 

> 

0 

where is an n X n identity matrix and n = T — K. If one uses P to 
transform model (2.4) one obtains 

Pz = 8* = PHXP -f aPHw, (2.9) 

where s* ^ iV(s*,n*), H* = POP', and 8* = PHXp, Thus, the LS estima- 
tor in (2.8) becomes 

^ _ z^Az _ s*'A*s* 

~ z'Bz ~ 8*>I*S* ’ 



PBP' = /* = 



I 

— I 
0 I 



( 2 . 10 ) 
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where A* = PAP\ A = {A + A')/2, P'P = It, and A is now symmetric. 
A has been introduced for convenience since z^Az = z'Az (for examples of 
this usage, see Carter and Ullah, 1979; Hoque, 1980). This form of the 
estimator is easier to work with when deriving exact results. Furthermore, 
we can make these derivations even easier by standardizing z in the following 
manner. Recall that the distribution of s* in (2.9) is 





s* ~ AT(s*,n*). 


(2.11) 


Thus s* can 


be standardized aa follows; 








(2.12) 


where 






and is a standardizing T X T transformation derived from the 

variance-covariance matrix of s*. This transformation is derived by an 
eigenvalue-eigenvector decomposition of 0*. That is, there exists a T x T 
orthogonal matrix R such that 




Rn* R! = Atxt — * = i,...,^} 


(2.13) 


and R'R = It- Therefore, 

R'A.R = n\ 




Hence, if 


R'A-RR'AR = R'A^R, 




where 


A* = diag{A?; i = 




then 


R'A.^-^'>R = n<-^'\ jeZi- 


(2.14) 


This allows us to write 






n*-i/2 = R>A-^f^R, 


(2.15) 



where 

A-^/=*=diag{A-'/*; i = l,...,T}. 

This allows us to write the estimator given in (2.10) as follows: 
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where 



8= ^ N{s,It), 

8 = PH xp, 

c = Rn*^^^A*n*^f^R', 

and 

’A„ I o' 

D= — 1 — ; 

[ 0 I 0 J 

D = A when the last K diagonal elements are equal to zero. 

The model and estimator expressed in (2.16) are equivalent to the form 
in (2.3) and (2.10). However, the alternative forms are required because they 
greatly simplify the theoretical derivations in Section 3 and the programming 
of the computer algorithm for the numerical results in Section 4. 

3. THEORETICAL RESULTS 

The results presented in the following subsections are derived using ex- 
act analytical techniques and small disturbance asymptotic techniques. The 
exact results are based on the work of Sawa (1972, 1978), Phillips (1980) and 
Hoque (1980). The small disturbance approximate results and the asymp- 
totic results are based on work by Kadane (1970, 1971), Carter and Ullah 
(1979) and Evans and Savin (1984). These articles have been instrumental 
in developing the analytical tools needed to solve the technical problems in 
this section. 

3,1 Exact Results 

We write (2.16) as follows: 



8 = s-j-rjy 



(3.1.1) 



w^here 

fj = (tRU*-^I^PHw ~ ^■(0,/), 
8 ~ N{8,It); 



the LS estimator for 7 is 



. _ a'C8 



(3.1.2) 



where D = diag{di; i = 1 , . . . , T ], d* is non-zero for i = 1, . . . , T - /i, and 
is zero otherwise, h = Ky and T — h is the degrees of freedom. 
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The distribution of 8 and the formula for 7^ can be used to derive the first 
and second moments^ of 7. (See “Endnotes” section after the Conclusions.) 

The following theorems, without proof®, contain the formulae for these 
exact moments. 

Theorem 3.1. The exact mean of the Least Squares estimator of the autore- 
gressive coefficient in a first-order stochastic difference equation with white 
noise, normal errors is, 

Efa) = exp I I I / + 2tD {tQCQs + tr(QC)} dt, 



where 



and 

g = (/ + 2 td)-^ = diag{l/(l + 2 tdi)-, »• = 1, . . . , T}. 



Theorem 3.2. The exact second moment of the Least Squares estimator of 
the autoregressive coefficient for a first-order stochastic difference equation 
with white noise, normal errors is 

Em = + 

X {(s'gcgs + trgc)* + 2trgcgc' + 4s'gc'gcg8} t dt, 



where 



Q*=Q-I 

and 

g = ( J + 2tD)-^ = diag{l/(l + 2tdi)-, i = l,...,T}. 



3.2 Small Disturbance Approximate Results 

The purpose of this subsection is to present formulae for the first and 
second moments of 7 in (2.10) that are approximations to the exact formulae 
in subsection (3.1). The analytical technique used is the small disturbance 
asymptotic approximation popularized by Kadane. 

The following theorems summarize the small disturbance approximate 
bias and mean squared error for 7 in (2.10). 
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Theorem 3.3. The small disturbance approximate bias of the Least Squares 
estimator of the autoregressive coefficient in a first-order stochastic difference 
equation with white noise, normal errors up to 0{a^) in probability is 



Eaift - 7 ) 




where z, A, B and H are defined in Section 2.'* 



Theorem 3.4. The small disturbance approximate Mean Squared Error of 
the Least Squares estimator of the autoregressive coefficient in a first- order 
stochastic difference equation with white noise, normal errors up to 0{a^) 
in probability is given by 

- 7)^ = 



where 



G = 4{z'Bz)-^ [tBHH'Bz + 6{^Bz)-'^{z'BHA'zy] 



tBHA'z • tiH'A + -s' AH' AH' Bz + -z'AH'BHA'z 

jL 



- S{z'Bz)~^ 

+ {tTH'Ay + tiH'AHfA - tiH'BH, 



and where H, A, B, z are defined in Section 2. 

The exact results in this section present general formulae for the first 
and second-order moments (hence bijis and MSE) of the distribution of the 
LS estimator of 7 in model (2.1). They are general in the sense that they are 
valid for any prior setting of the unknown nuisance parameters yo, (t, and 
X matrix as well as the true parameters 7 and B. However, the formulae 
are not general in the sense that we have assumed normality of the error 
distribution which is crucial for the exact results. 

The small disturbance results are also general, in the sense that they are 
valid for all combinations of yo, 7 , and X matrix that will render the 
small disturbance expansion valid. ^ This will inevitably restrict the admis- 
sible parameter space and limit the broad generality of the approximation. 
The consequences of violating the admissibility condition show up vividly in 
the numerical calculations section. 



4. NUMERICAL RESULTS 



The purpose of this section is to evaluate the formulae developed in 
the previous section under alternative scenarios for the unknown nuisance 
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parameters and the exogenous data series. Based on these alternative prior 
beliefs about the location in the parameter space, comparative tables are 
presented featuring numerical evaluations of the general moment formulae 
with emphasis on the sign and size of biases, on the size of the MSE, and 
on the performance of the approximation formulae compared to the exact 
results. 

The choice of the parameter space was governed by the availability of 
computing resources.® Three types of data for the X matrix were used, each 
being used as an exclusive case and not together. Thus, X is a T x 2 matrix 
OT K = 1. This is done to see the effect of alternative types of data. The 
choices for Xt are: 

1) TRENDING DATA 

2) NONTRENDING DATA 

Xt is randomly drawn from a uniform distribution in the interval [0, 1]. 

3) AR(2) DATA 

Xt = PlXt-l + P2^t-2 4- 

where pi = .75, p 2 = —.5, St ^ iV'(0, 1), and xt follows a stationary 

AR(2) process.^ 

The rest of the parameters of the model also will take on various values. 
In all tables, 7 = [.4, .8, 1.0, 1.01, 1.025], = [.5,1.0], yo = 1.0, T = 10, 

/i = 0 or 1.0. In one table the effect of an increasing sample size is observed 
by setting T = 20. The specification of the parameter space is far from 
exhaustive; however, from the analysis of this limited number of scenarios 
some interesting observations can be made. 

Using the specifications of the model parameters and the exogenous data 
series, the {zt} series was generated, allowing us to obtain the elements of 
the vector z that are present in all the formulae. Therefore z is generated 
by starting with model (2.2), 

Zt = S + ')Zt-l + XtP 4 - (TWt, 



then 

t-i t-i t-i 

= '1*Zo + ^ 7*5 + X) 7’**-.'^ + ‘^X^7*«'t-<, 

t=0 f=0 »=0 

where zq = 0, and ^ = yo(7 — 1) 4- /i. Therefore 

t-i t-i 

E{zt) = zt = ^'1*S + 

»=0 t=0 



(5.1) 



(5.2) 
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Table 1. Trending Data 





Xt = exp{.lf} , 




T = 10 . 


M = o. 


a* 


7 


Exact 

BIAS MSE 


( 7 -Approximate 
BIAS MSE 


Asymptotic 

MSE 


.5 


.400 


-.0590 


.0859 


-.7729 


.8313 


.2722 


1.0 


.400 


-.1280 


.1088 


-1.5460 


2.7810 


.5444 


.5 


.800 


-.0666 


.0297 


-.1270 


.0561 


.0313 


1.0 


.800 


-.1344 


.0616 


-.2540 


.1619 


.0626 


.5 


1.000 


-.0240 


.0063 


-.0234 


.0068 


.0060 


1.0 


1.000 


-.0490 


.0139 


-.0468 


.0152 


.0121 


.5 


1.010 


-.0226 


.0058 


-.0210 


.0061 


.0055 


1.0 


1.010 


-.0454 


.0126 


-.0421 


.0135 


.0110 


.5 


1.025 


-.0206 


.0050 


-.0179 


.0053 


.0048 


1.0 


1.025 


-.0403 


.0107 


-.0357 


.0114 


.0096 



The variance-covariance matrix for vector z can be deduced from (5.1): 



E{[zt-E{zt)][zs-E{zs)]} = 





II 

Co 




t ^ 8. 



(5.3) 



which are the elements of a^HH* = Cl when t = 1,2,...,T. Thus with the 
specification of the parameter space and the construction of the distribu- 
tion of z completed, running the computer algorithm® for various parameter 
values produced Tables 1-5. 

The numerical results presented in Tables 1-5 reveal some interesting 
information about the behaviour of the LS bias and MSE in finite samples. 
After thorough examination of this tabular information, the following groups 
of observations can be made. 

1 . The exact theoretical results in Section 4 show some interesting tenden- 
cies with respect to cr^, T, 7 and xt, based on the numbers presented in 
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Table 2. Nontrending Data 

xt is randomly selected from a uniform distribution in the interval [0, 1]. 
T=10, fi = 0. 



< 7 ^ 


7 


Exact 

BIAS MSE 


cr-Approximate Asymptotic 
BIAS MSE MSE 


.5 


.400 


-.1310 


.1057 


-2.562 


.1868 


1.206 


1.0 


.400 


-.1508 


.1126 


-5.125 


-1.6650 


2.412 


.5 


.800 


-.2588 


.1432 


-.7862 


1.308 


.2648 


1.0 


.800 


-.2651 


.1488 


-1.5720 


4.702 


.5296 


.5 


1.000 


-.1222 


.0675 


-.0750 


.0485 


.0218 


1.0 


1.000 


-.1859 


.1045 


-.1499 


.1505 


.0436 


.5 


1.010 


-.1133 


.0631 


-.0658 


.0414 


.0193 


1.0 


1.010 


-.1786 


.1009 


-.1315 


.1271 


.0387 


.5 


1.025 


-.1003 


.0567 


-.0538 


.0329 


.0162 


1.0 


1.025 


-.1672 


.0951 


-.1077 


.0991 


.0324 



Tables 1-5. First, it is obvious from viewing Tables 2 and 4 that the ab- 
solute value of the bieis and MSE are decreasing functions of the sample 
size, T, at all values of cr^, 7 and Xf. Also, all tables predict that the 
absolute bias and MSE are decreasing as <r decreases for all values of T, 
7 and Xt, except when Xt is AR(2) and 7 = .4 or 7 = .8. These contrary 
CEises are not easily explained but could have something to do with com- 
mon factors in the autoregressive portions of the model. Further, Table 
5 suggests that the inclusion of an intercept in the model together with 
the AR(2) series seems to eliminate these counter-intuitive observations. 
Hence, with few exceptions, the expected asymptotic tendencies of the 
LS bias and MSE are observed with respect to both a and T. 

Secondly, while holding the sample size (T), standard deviation (cr) and 
data series (xt) constant, and letting 7 vary, some interesting conclusions 
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Table 3, Autoregressive Process of Order 2 

Xt = PlXt-l + P2^t-2 + Vt, T = 10 

Pi = .75, P2 = -*5, vt'^N{0,l), /i = 0 



<72 


7 


Exact 

BIAS MSE 


(j-Approximate Asymptotic 
BIAS MSE MSE 


.5 


.400 


.2582 


.0833 


-.0380 


.0227 


.0202 


1.0 


.400 


.2065 


.0758 


-.0760 


.0503 


.0404 


.5 


.800 


.0295 


.0239 


-.0661 


.0325 


.0142 


1.0 


.800 


-.0117 


.0407 


-.1322 


.1017 


.0284 


.5 


1.000 


-.0716 


.0390 


-.0671 


.0343 


.0095 


1.0 


1.000 


-.0945 


.0542 


-.1343 


.1182 


.0190 


.5 


1.010 


-.0755 


.0399 


-.0668 


.0342 


.0093 


1.0 


1.010 


-.0971 


.0544 


-.1335 


.1183 


.0185 


.5 


1.025 


-.0812 


.0415 


-.0661 


.0341 


.0089 


1.0 


1.025 


-.1008 


.0549 


-.1322 


.1184 


.0178 



can be drawn from these tables. For all values of tr, T and Xty the absolute 
bias first tends to increeise as 7 increases from .4 to .8 then decreases as 
7 reaches unity and beyond. Again, Table 3 shows a result contrary to 
this generalization. When x* is an AR(2) series without an intercept, 
absolute bias increases as 7 approaches unity and beyond. The MSE 
results outline an identical pattern for all xt series, including the AR(2) 
series. 

An interesting observation concerning the behaviour of the LS estimator, 
7, is that it confirms other numerical work done in the literature. The 
increasing bias and MSE as 7 approaches the unit circle (without reach- 
ing it) confirms the findings of Phillips (1977), Sawa (1978), Tse (1981) 
and Tanaka (1982) for the stationary case. Also, the findings of Evans 
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Table 4. Nontrending Data 

Xt randomly selected from a uniform distribution in the interval [0, 1]. 
T = 20, fjL = 0 



<72 


7 


Exact 

BIAS MSE 


^-Approximate 
BIAS MSE 


Asymptotic 

MSE 


.5 


.400 


-.0469 


.0468 


-1.146 


.4633 


.4276 


1.0 


.400 


-.0709 


.0503 


-2.292 


.9979 


.8552 


.5 


.800 


-.1476 


.0508 


-.3702 


.1440 


.0759 


1.0 


.800 


-.1522 


.0550 


-.7404 


.4244 


.1518 


.5 


1.000 


-.0295 


.0068 


-.0196 


.0037 


.0024 


1.0 


1.000 


-.0618 


.0162 


-.0392 


.0100 


.0048 


.5 


1.010 


-.0243 


.0056 


-.0158 


.0029 


.0020 


1.0 


1.010 


-.0553 


.0150 


-.0316 


.0077 


.0039 


.5 


1.025 


-.0169 


.0024 


-.0112 


.0020 


.0014 


1.0 


1.025 


-.0437 


.0076 


-.0223 


.0053 


.0029 



and Savin (1984) suggest that anything that increases the signal-to-noise 
ratio (or non-centrality parameter) will cause the absolute bias and MSE 
to decrease. From (5.1) of this section, the expression for Zt shows that 
7 is an increasingly important parameter of the model especially when 
7 reaches the unit circle. Thus as 7 increzises beyond unity, it is obvious 
that the non-centrality parameter {z^z/{2a)) is increasing rapidly and 
hence causing the damping effect on the bias and MSE. 

Finally, the tables present information about the behaviour of the bias 
and MSE when alternative exogenous data series are assumed to be 
present. At all values of a and 7, the bias and MSE tend to be smaller 
in the case of trending data than in the cose of AR(2) data. Also, from 
Tables 2 and 3, the AR(2) data yield results that are smaller in absolute 
value than the results for nontrending data. Therefore, it is obvious that 
trending data scenarios provide LS estimators that have a smaller bias 
and MSE than in the case of nontrending data. 
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Table 5. Autoregressive Process of Order 2 
Intercept, /i = 1.0, T = 10 



Exact a-Approximate Asymptotic 

BIAS MSE BIAS MSE MSE 



.5 


.400 


-.0796 


.0604 


-.1410 


.0510 


.0948 


1.0 


.400 


-.1119 


.0851 


-.2821 


.0146 


.1895 


.5 


.800 


-.1110 


.0598 


-.1263 


.0712 


.0387 


1.0 


.800 


-.1752 


.1032 


-.2526 


.2075 


.0774 


.5 


1.000 


-.0428 


.0209 


-.0333 


.0142 


.0111 


1.0 


1.000 


-.0968 


.0546 


-.0665 


.0346 


.0222 


.5 


1.010 


-.0396 


.0191 


-.0303 


.0130 


.0103 


1.0 


1.010 


-.0918 


.0514 


-.0605 


.0315 


.0206 


.5 


1.025 


-.0350 


.0164 


-.0261 


.0114 


.0092 


1.0 


1.025 


-.0843 


.0463 


-.0522 


.0274 


.0183 



An additional point can be raised upon the comparison of Tables 3 and 
5. The inclusion of an intercept into the model (in the AR(2)) case 
seems to cause the absolute bias and MSE to increeise when 7 > 1, but 
when 7 < 1 the values tend to be smaller. The intuitive explanation for 
this result might be that when 7 reaches unity or beyond the increased 
uncertainty (noise) of an additional nuisance parameter is completely 
overwhelmed by the rapidly increasing noncentrality parameter (signal) . 
On the other hand, the increased uncertainty of an additional regressor 
in the model (nuisance parameter) is too large to be offset when 7 < 1. 
This result was also found by Sawa (1978) in a simpler context and under 
stationary conditions. 

The exact results on the LS estimator for 7 have revealed some valuable 
conclusions. Generally, the results show that the distribution of 7 in 
finite samples is very sensitive to the portion of the parameter space 
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considered and to the exogenous data-generating series. Finally, the 
tendencies of bias and MSE are linked to the signal-to-noise ratio in an 
inverse way. 

2. The small disturbance approximate results are also presented along with 
the small disturbance asymptotic MSE in Tables 1 through 5. It can be 
noted first, just as with the exact results, that the approximate bias 
and MSE shrink as a decreases and T increases for all values of 7 and 
exogenous data. This is also seen with the asymptotic MSE for all 7 and 
xt- Thus, the expected asymptotic tendencies of the LS bias and MSE 
are observed using the approximate formulae. 

Secondly, the numerical values for the approximate bias and MSE reveal 
some strong conclusions about tendencies as 7 varies. If one considers 
the bizis, there is a tendency for the absolute approximate bias to de- 
crease as 7 increases from 7 = .4 to 7 = 1.025 for all data except the 
AR(2) case. In that case, the bias first increeises as 7 increases from .4 to 
.8 and then declines as 7 increases further. The approximate MSE be- 
haves consistently when 7 > 1 for all xt\ however, it behaves erratically 
at values of 7 less than unity. Specifically, when 7 > 1, the approxi- 
mate MSE decreases as 7 increases for all Xf. In situations such as are 
illustrated in Tables 1 and 4, the MSE decreases as 7 increa^ses towards 
unity. Tables 2, 3 and 5 show the MSE increasing as 7 increases from 
.4 to .8, and, in Table 3, as 7 increases from .4 to 1.0. Further, from all 
tables, the asymptotic MSE is decreasing as 7 increases from .4 to 1.025. 

As in the exact results, the approximate results are sensitive to the 
exogenous data series generating Zt. Identical conclusions about the size 
of the bias and MSE in trending data scenarios, relative to nontrending 
situations, can be made. That is, that absolute bias and MSE are larger 
for nontrending Xt data as opposed to AR(2) and trended data. Also, 
the validity of the observation about the inclusion of an extra regressor 
(intercept) in the model is evident using the approximate formula. That 
is, the bias and MSE increase with an extra nuisance parameter when 
7 < 1 but decrease when 7 > 1. This evidence suggests that the stronger 
the mean (signal) the smaller the values of the bias and MSE. 

The evidence presented by Tables 1-5 concerning the approximate bias, 
MSE and asymptotic MSE, essentially corroborate the general conclu- 
sions drawn on the exact results. However, caution must be observed in 
situations where the signal-to-noise ratio is small. 

3. Having shown that most of the proper conclusions about the exact bias 
and MSE can be found through the approximate results (7 > 1), the ab- 
solute accuracy of the approximation as opposed to the order 1 asymp- 
totic approximation should be assessed. In light of this concern, the 
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results show that for all xt, a and T, the approximate results perform 
very well when 7 > 1. Also, the approximation formulae outperform the 
asymptotic formulae for all 7 > 1, <7, T and Xt series. Earlier, it was 
suggested that the signal-to-noise ratio played a large role in determining 
the size of bias and MSE. Again, we suggest that a large signal-to-noise 
ratio is the main reason for increeised accuracy of approximation. For ex- 
ample, any of our tables will reveal that the approximation to the exact 
results is much better when = .5 than when = 1.0; also, consider 
the approximation when 7 > 1 and Xt is trending, compared to when 
7 > 1 and Xt is nontrending. Further, comparison of the results when 
7 < 1 leads to the conclusion that the approximate formulae perform 
very poorly. In fact, the asymptotic results of unbiasedness and asymp- 
totic MSE outperform the approximate results in most cases when 7 < 1 
(specifically 7 < .8). 

In situations where the signal of the model dominates the noise, the 
approximate formulae will behave very well in terms of predictions about 
tendencies with respect to nuisance parameters and will outperform the 
asymptotic formulae in terms of numerical accuracy. 

The major findings in this section are summarized in the three distinct 
groups above; however, a common theme seems to be clear. It was found 
that the exact, approximate and asymptotic moments are very sensitive to 
the type of exogenous data and parameter specification used to generate the 
mean of Zt. Ultimately, the stronger the signal {zt), the smaller the risk in 
using LS in this context. The changes in the factors <7, T, and 7 that in- 
creased the signal-to-noise ratio caused the absolute bias and MSE to shrink 
and improved the performance of the small disturbance approximations. 



5. CONCLUSIONS 

It was the objective of this paper to analyze the distribution, in terms 
of its first and second moments, of the LS estimator of the autoregressive 
coefficient in a first-order stochastic difference equation under the assump- 
tion of white noise, normal errors. Under the condition that the moments 
of this estimator exist up to a specific order, theoretical formulae were de- 
rived, using some well established techniques, for the bias and MSE of the 
estimator. Unfortunately, these formulae were not able to provide any quan- 
titative information about the size and sign of the bias and size of the MSE 
in their raw form. Hence they were evaluated numerically under alternative 
scenarios for the nuisance parameters. After analysis of the computational 
results, conclusions were drawn about the behavioural tendencies of the bias 
and MSE of LS across a defined portion of the nuisance parameter space. 
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The conclusions reached in this paper are that the bias and MSE vary 
inversely with the signal-to-noise ratio, and that the (t- approximation’s per- 
formance improves with increeising signal-to-noise ratio. Also in the context 
of the model studied here, the cr- approximation performs well when the true 
parameter is around or outside the unit circle. This isn’t surprising, since 
the signal blows up quickly when 7 > 1. 

In most cases, it was found that the <7- approximation outperformed the 
asymptotic results; however, in certain portions of the parameter space, 
researchers would be better off using the asymptotic formulae for MSE. This 
conclusion is related to the validity of the a-expansion itself under specific 
scenarios for the nuisance parameters. That is, when the signal-to-noise 
ratio is low (i.e., a is large and 7 is small), the condition under which a a- 
expansion is admissible is violated. Therefore, the expansion is invalid and 
the numerical values for the formula make little sense.® 

The information contained in this paper could be of some benefit to 
empirical researchers. Once they are able to decide on a prior (or null hy- 
pothesis) for the true parameter, they could determine the risk (in terms of 
MSE) of using LS estimators by applying the above exact formulae and/or 
the ^-approximate formulae when appropriate (7 > 1). The MSE numbers 
could be used to specify corrected standard errors for the ^-statistic and 
improve the accuracy of the statistical inference. 

ENDNOTES 

2. The moment formulae presented in this paper, both exact and approx- 
imate, are defining approximating moments that actually exist. Evans 
and Savin (1981) and Mariano (1972) proved that the rth moment exists 
under the condition that the degrees of freedom in the model are greater 
than twice the order of the moment. Their estimators are similar to that 
under study here. Additional proof of this condition for existence can 
be obtained from the author. 

3. Proofs for the theorems in this section can be obtained from the author. 
The proofs are derived using the results given by Sawa (1978), Phillips 
(1980), and Hoque (1980). 

4. This notation is used to distinguish the exact bieis jE (7 — 7) from the 
truncated series approximation, £?o(7 ~ 7)- 

5. In deriving the small disturbance approximation, the expansion is only 
valid under certain conditions. For example, the LS sampling error can 
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be expanded as follows: 



1 + 



V 



(7 - 7) = 

X [zAw + aw^H^Aw] 
= ao + fli + + * * ' > 



f2az^BHw + a^w*WBHw\] 



-1 



[z'Bz) 






where the a**s are of successively higher orders in a. This expansion is 
valid if and only if 



2az'BHw + a^w'WBHw 



As can be seen, this ratio is random and will obviously take on values 
greater than one in absolute value in some parts of the parameter space. 

6. Numerical approximation of an integral is a difficult task to do cheaply. 
Thus, I must thank the Department of Economics at The University 
of Western Ontario and ultimately the University itself for allowing me 
enough resources to complete my work. 

7. This process was generated with an initial value of xq = 1- The first 
20 values were dropped which allows the value for X 20 to be our initial 
value. This is a method suggested by Phillips (1980). 

8. The algorithm used to compute the exact results was based on directly 
approximating the area under the function. Using Simpson’s Rule to 
calculate this approximation, the values in the tables are accurate up 
to 1.0 X 10““*. The algorithm used wa^ obtained from John L. Knight. 
Professor Knight spent a good deal of his own time helping me program 
this algorithm for which I am greatly in his debt. Any unseen errors, 
however, are mine alone. 

9. See endnote 5. 
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THE APPROXIMATE MOMENTS OF 
THE SSLS REDUCED FORM ESTIMATOR 
AND A MELO COMBINATION OF OLS-3SLS 
FOR PREDICTION 

1. INTRODUCTION 

This paper develops the small sample approximations to the moments of 
the SSLS reduced form estimator in a general linear simultaneous equations 
model under the classical assumptions. These approximations easily special- 
ize to the case of A;-cl 2 iss reduced form estimators, and they may be used to 
evaluate conditional forecasts based on the A;-c1elss or SSLS estimators. 

The method of derivation and its interpretation are discussed in Section 
2. It can be seen that the techniques are more generally applicable and may 
be used to derive the approximate moments of any standard reduced form 
estimator. 

Sections 2 and S of the paper discuss examples of how these approximate 
expressions have been used, for instance, to evaluate estimators or mixtures 
of estimators under quadratic loss and other moment-based criteria. In 
particular, an approximate Minimum Expected Loss (MELO) mixture of 
unrestricted OLS and SSLS reduced form estimators is described. 

Derivation of asymptotic expansions for the moments of econometric esti- 
mators began with Nagar (1959) who dealt with the A;-class estimators of the 
structural coefficients in a linear simultaneous equations model. Nagar also 
showed how his formula for the bias may be used to develop “almost unbi- 
ased” estimators. Sawa (1973a) developed the “almost unbiased” application 
of these formulae, and (e.g.) Nagar and Carter (1976), Sawa (1973b) and 
Zellner and Vandaele (1975) have utilized these approximations to develop 
Minimum Variance, Minimum Mean Squared Error (MMSE) and MELO 
estimators. Such expansions to estimator bieises have more recently been 



^ Department of Economics, Indiana Univer^.ity, Bloomington, Indiana 47405, 
also University of Southern California. 

359 



I. B. MacNeill and G. J. Umphrey (eds.), Time Series and Econometric Modelling, 359-371. 
© 1987 by D. Reidel Publishing Company. 




360 



ESFANDIAR MAASOUMI 



used by Rothenberg (1984) and others for correction of ML, GLS and other 
statistics in order to discuss questions of second and higher order efficiency. 

Considerably less attention has been given to estimators and forecasts 
in the reduced form context. While asymptotic properties of the reduced 
form estimators can be evaluated simply by the methods of Goldberger et 
al. (1961) and Dhrymes (1973), higher order approximations do not seem 
to have been developed for many reduced form estimators. One exception is 
the work of Nagar and Sahay (1978), who were concerned with the Partially 
Restricted Reduced Form and the 2SLS estimators. The existence of the mo- 
ments of several reduced form estimators is discussed by McCarthy (1972), 
Sargan (1976b), Maasoumi (1977, 1978, 1985, 1986), and Knight (1977). 

2. APPROXIMATE MOMENTS OF THE 3SLS 

Following Rothenberg (1984), we consider a random sample of size T 
from a population with a continuous density function which depends on 
an unknown parameter 0. Let be the standardized estimator of 0 and 
Frik) = Pr[9T < k]. For estimators which are Best Asymptotic Normal 
(BAN) and admit an Edgeworth expansion to order T“^, we can write: 

FT{k) = F{k) + 0{T-^), ( 1 ) 

= + ( 2 ) 

In (2), i]{^) is the standard normal distribution function and /i(-) and 7 (*) 
are usually polynomials multiplied by rj[k). rj is referred to 2 is a “first order” 
approximation, and F a “second order” approximation to Ft- Consequently, 
estimators that satisfy (l)-(2) are first-order efficient and so may be com- 
pared on the ba.sis of the moments of F, i.e., the “second-order” moments. 
In this section we derive the moments of F(*) for the case of the 3SLS re- 
duced form estimator. These approximate moments are thus well defined as 
the moments of the distributions which approximate the exact distribution, 
i^r(-). [The moments of rj are first-order approximate moments of Ft in ex- 
actly the same sense. The only distinction with the traditional “asymptotic 
theory” is therefore one of degree of approximation and not of interpreta- 
tion.] 

The validity of Edgeworth approximations to Ft is discussed by Sargan 
(1976a) and generally does not depend on the existence of the moments. 
When the moments of Ft do not exist, however, care must be taken in 
interpreting the moments oi F [or rj) as approximations to the moments of 
Ft- In this situation the value of the approximate moments derives from 
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the value of F (or t)) as dk representation of Ft- To the extent that this is 
done adequately by F, the study of its moments is of value in characterizing 
both F and Ft- We emphasize that, once again, there is no distinction 
between the interpretations of the moments of the asymptotic distribution 
of estimators and those of higher order approximate distributions. 

The Linear Simultaneous Equations Model (SEM) is defined for Y (T X n) 
endogenous variables and Z {T X m) non-stochastic exogenous variables as 
follows: 

AX' = BY* + TZ* = U*, (3) 

where X = (y, Z) is a T X (n-j-m) matrix of all observations, A = (B, F) is the 
(n X (n+m)) matrix of the unknown coefficients, and U is the (T x n) matrix 
of the random disturbances such that each row, Uty satisfies the following 
assumption: 

Al: Cft -i.i.d.(0,S). 

Further assumptions of the classical SEM are: 

A2 : limr_^oo T~^Z*Z = M, a constant matrix of rank m; 

A3 : jB is non-singular. 

The reduced form of (3) is given by: 

y' = PZ* + y', (4) 

where P = — B“^F, y' = B~^U' and, from Al, rows of V have zero mean 
and a common covariance matrix Q = When we need to, we 

assume that the a priori (identifying) restrictions on A are of the exclusion 
(zero order) type that may be represented as follows: 

8- Sa = Vec A. (5) 

Here Vec denotes stacking by rows^ a is the vector of the non-zero elements of 
A, 5 = diag(S'i, . . . , Sn) such that XSi = Xi represents the selection from X 
of only those columns that appear on the right hand side of the ith equation. 
The selection vector s represents the “normalization” restriction since its tth 
subvector, is such that Xsi = y,-, the left hand side endogenous variable 
of the tth equation. 

We define the following estimators of P: 

P={Y'Z){Z'Z)-^ (6) 

is the Unrestricted Leeist Squares (ULS) estimator of P in (4). Under the 
classical assumptions it is unbiased but less efficient than some Restricted 
Reduced Form (RRF) estimators, 

P+ = -B+'‘r+, 



( 7 ) 
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where JB+ and F+ may be such full information estimators as 3SLS or FIML. 
In this paper I assume that the estimator A'^ satisfies the following property: 

A4 : AA = A+ - A = Op(r~^/*). 



Define AB and AF as AA was defined. Then: 

P+ = -(B + AB)-^(r + AF) 

= -(B~^AB + J)-^B“‘(F + AF). (8) 

Expanding (B~^AB + I)~^, 

{B~^AB + I)~^ = / - S"‘ AB + B~^AB • B”^AB 

+ Op(T-®/*) (9) 

and using this in (8), we have: 

AP = P+ - P 

= -P-^ AF + B-^AB B-^AT + B~^AB • P"^F 

- B-^AB • B~^AB • P-^ AF 

- P-^ AP • B-^AB • P-^F + Op(T-2). (10) 

If one defines Q = (;^) and rearrange, one finds that: 

AP = -P-^ AA • Q + B-^AB • B~^AA • Q + (11) 

Let b'^ = E (Vec AP) and V'(p“*‘) denote, respectively, the bias and the 
variance-covariance matrix of p“*“ = Vec P“^ . Then: 

V(p+) = E [(VecAP)(VecAP)'] - 6+6+', (12) 

where the first term on the r.h.s. of (12) is the MSE of p+. Approximations 
to 6+ are obtained by taking expectations of (11), term by term. For the 
first term, we note that 

Vec (-P-^ • AA • Q) = - (P-^ 0 g') VecAA, (13) 

where 0 denotes the Kronecker product. The r.h.s. of (13) is the basic 
relationship used by (e.g.) Dhrymes (1973) and Goldberger et al. (1961) to 
obtain the asymptotic properties of P+ from those of A+. We will employ 
higher order expansions of P(Vec A A) in (13), and will combine these with 
the expectation of the second term of (11). To obtain the latter expectation. 




SSLS REDUCED FORM ESTIMATOR 



363 



it is easier to work first with simpler linear functions of the required terms. 
We will thus first obtain: 

E[tr{'^B-^AB-B-^AA-Q)] (14) 

for a known arbitrary n X n matrix and then recover the desired expec- 
tation from (14). We first note that: 

(14) = E |tr (Q$B-^)' AA'] } 

= B {(Vec AA)' (B'-‘ gUj^B"^) (Vec AB')} . (15) 

Let n be a permutation matrix such that, for a matrix D, 

Vec D' = n • Vec D. (16) 

We also define the following “slash” product, 0, for any two matrices jPi 
and F 2 : 

(jPi 0 F 2 ) = {Fi 0 i^2)n. (17) 

Subsequently, using (15)-(17), we conclude: 

(14) = E [(Vec AAy 0 (Vec AB)] 

= tr {(B'-^ 0Q$B-^) • B[(Vec AB)(Vec AA)']} . (18) 

The expression inside { } and, in particular, the covariance between AB and 
AA is to be evaluated. Let the zisymptotic variance of AA be denoted by 
G. Then (an approximation to) the variance of A A is given by 

E [(Vec AA)(Vec AA)'] = ^G, (19) 

where G is n(n + m) x n(n + m) and, (e.g.) for 3SLS, we have 

<3 = 5[5'(S-^®B)S]“^S', (20) 

where R = QMQ'\ see Sargan (1978) or Mazisoumi (1978). [It is not nec- 
essary at this stage to consider approximations to the variance of AA. But 
this approjdmation is inevitable at the final stage.] Define tp = (/nOm) and 
ip* = {I ^ rp). It follows that: 



Vec AB = ip* • Vec AA 
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and 

E [(Vec AB)(Vec AA)'] = = G, say. (21) 



Using (21) 


in (18), we may assert: 








(14) = tr{[B'~^ 0Q$B 


-]G} 






= tr { [B'-^ ® 


■'] n;;G} , 


(22) 


where 










/Gu ••• 


Gi„\ 






n”G = ; 


: 


(23) 




VG„i ... 


GnnJ 





with each submatrix Gij having n X (n + m) dimensions. From (22) it may 
be verified that: 

n 

{14) = -tv (24) 

y=i 

where b** is the tth diagonal element of and noting that JE^[tr(*)] = 

tr[J5?(*)], we have: 

n 

E (B~^AB ■ B~^AA • Q) = ^ V^B-^Gj^Q. (25) 

5 = 1 

Gathering terms from (25) and (13), we have: 



h+ = {B~^®Q') 



^6«(Vec Gjj) - E{Wec AA) 

5 = 1 



+ .... 



(26) 



Approximations to Vec Gjj and E{Yec AA) should retain all terms of order 
T~^ or larger in order to correspond to (11). Lower order approximations 
for 6"^ will necessarily omit the first term of (26). It remains to replace 
£^(Vec A A) with the approximate bias of A*^. For a variety of structural 
estimators, such as A;-cl 2 iss, FIML and 3SLS, these approximations have been 
given in the literature. For 3SLS, in particular, these are given by Sargan 
(1976a). To order T“^, the bias is as follows: 

E{Vec AA) = mG (S"^ (g) /) g + (/0 Q J q 

+ Qq-2G (S"^ 0 I) - G [(E-^FS"^) 0 /] g 
- G (E-^ ® 4-.) g, (27) 




3SLS REDUCED FORM ESTIMATOR 



365 



where is the block transpose of ^ = (S ^ = J2j ^ = 

(y Z), y = y -y , g = (gl, gi, . . . , g^)', g, = 1/T V* = {UB'-^O) 

is T X (n-fm), is the T-element vector of disturbances in the ith equation, 
H = tr[G(/0 X^X)], and Q = diag.((5i,Q2> • • • >Qn)» where is obtained 
from (X^X»)“^ by adding rows and columiw of zeros corresponding to the 
excluded variables in the ith equation, and Qs = Z)r=i Q%- 

The above techniques may be used to obtain higher order approximations 
to the second and higher order moments of P“*". These expressions can be 
used to obtain almost unbiased estimators, and are also useful in comparing 
higher order efficiency of the estimators. We will proceed to demonstrate 
their use by considering an “optimal” mixture of the ULS estimator (P) and 
the 3SLS (P"^) reduced form estimators under quadratic loss. The latter (or 
its expectation) is approximated using the moment expansions given above. 

2. APPROXIMATE MINIMUM EXPECTED LOSS (MELO) 
COMBINATIONS OF OLS-3SLS 

This and the next section are based on Maasoumi (1985). 

Let P = y'Z(Z'Z)'“^ denote the ULS estimator of P and P+ = 
— another estimator derived from such restricted estimators of B 
and r as 2SLS, 3SLS, FIML, etc. We propose the following mixed estimator 
of P: 



P* = AP+(1- A)P+ (28) 

= P+4-A(P-P+). (29) 

In the remainder of this paper lower case letters p, p, p"^ denote vec P, vec 
P and vec P“^ , respectively. Let the bias in p* be denoted by b* . It is readily 
seen that, since E{p) — p = 0, 

|6*| = \E{p*) -p\ = 1(1 - A)6+| < |6+1, if 0 < A < 1. (30) 

Using the following identity, the variance matrix of p*, V{p*), is obtained in 
terms of the variances of p, p~^ and their covariance: 

p* - E{p*) = A(p - p) + (1 - A)(p+ - ^;(p+)); (31) 

F(p*) = AV(p) + (1 - A) V(p+) + 2A(1 - A)cov(p,p+). (32) 

Under the standard assumptions of our model, and since the mixing param- 
eter A is a constant, the bias and variance of p* would be finite if p*^ has 
finite moments. For instance, when p+ represents the FIML estimator, b* 
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and V{p*) are finite so long asT — n — m>2; see Sargan (1976b). If 
represents either 2SLS or SSLS, then 6* and V{p*) will not be finite unless 
A = l(p* = p). For exactly identified models P* = P since = P in that 
case. 

Since p is a consistent estimator under our assumptions, it is seen that 
p* will be consistent if p"^ is consistent. If not, the inconsistency in p* will be 
smaller than that in p^" as long as A G [0, 1]. If both p and p"^ are inconsistent 
but have the same limit in probability, then p* will also be inconsistent with 
the same plim as p (or p“^). 

As for asymptotic efficiency, the derivations given in the next section 
may be used to verify that: 

AV{p*) = X^AV{p) + (1 - X^)AV{p+) 

= AF(p+) + A* [AV{p) - AV(p+)] , (33) 

where AV (•) denotes the zisymptotic variance. From (33) it is clear that p* 
is more efficient than the ULS so long as p is less efficient than the restricted 
estimator p+. While this is the case for the full information estimators 
such as the 3SLS and FIML, it is not always so for the limited informa- 
tion estimators such as 2SLS and LIML; see Dhrymes (1973). The latter 
statement holds even if p“^ is replaced by the Partially Restricted Reduced 
Form (PRRF) estimator of Amemiya (1966) and Kakwani and Court (1972). 
For while PRRF has finite moments (see Knight, 1977), it is not necessar- 
ily more 2 isymptotically efficient than 2SLS. On the other hand, p* is less 
efficient than 3SLS and FIML, but can be more efficient than (e.g.) 2SLS 
whenever p is. 

3. MIXED PREDICTION UNDER QUADRATIC LOSS 

Let yy = P*Zf be the predictor of Y/ conditional on Zf under the as- 
sumption that Yf = PZf + Vfy where Vf denotes the forecast period random 
disturbance with the same properties as Vt, t = 1,...,T. The forecast error 
and a general quadratic loss are given as follows: 

Y;-Yf={P*-P)Zf-Vf, (34) 



and 



HY;,Yf) = (y; - YfYw^iY; - Yf) 

= tr[W*(Y/-Yf)(Y;-Yfy], 



( 35 ) 
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where W* is a, symmetric, positive definite matrix of known weights. From 
(34)-(35) the expected loss (risk) is derived as follows: 

r{y;) = E{L{-)] 

= tr[VrMSE(F*)l + tr[W^n], (36) 

where W = {W* ^ ^f^f) MSE(P*) is the MSE matrix of p*. Since the 
second term of (36) is common to all conditional forec8ists, we focus on the 
first term which is a well known estimation risk function. Consequently, min- 
imization of R{Yf) is equivalent to minimization of i^(p*) = tr[IV’MSE(p*)] 
with respect to A. We note that, from (30) to (32): 

R{p*) = Xhv[WVo] + (1 - X)hi[WV^] 

+ 2A(1 - A)tr[Wcov(p,p+)] + (1 - A)^6+'W6+, (37) 

where Vq = V'(p) and = V{p^). To minimize i^(p*) with respect to A 
consider 



= 2A trfW^Vo] - 2(1 - A) tr[W^l^+] 
aA 

+ 2(1 - 2A) tr[W^ cov(-)] - 2(1 - X)b+'Wb+ (38) 



and 



~~J~ = 2 tr[W"Vo] + 2 tr[W"V^+] - 4 trfW^ cov(-)l + 26+'W^6+ 
oX^ 

= 2 tr[W(p - p+)] + 26+'W6+ > 0, (39) 

where V{p — p'^) denotes the variance of (p — p"^). From (38), the optimal 
value AJ of A is obtained by solving dR{>)/dX = 0, 

^ 6+W + tr[W(y+^cov(0)] 

^ b-^Wb+ + ti[WV{p-p+)] ‘ ^ ^ 

It may be observed that the denominator of AJ is non-negative, and AJ < 1 
if [Vb - cov(p,p"*")] is positive semidefinite. In what follows we demonstrate 
that this condition, as well as the range of possible values for AJ, depends on 
the level of approximation considered for the otherwise unknown moments 
entering in (40). Equivalently, these issues depend on the order of finite 
sample approximations for the L(*) and i?(*) functions. 

Strictly speaking, since 2SLS and 3SLS reduced form estimators have no 
finite moments, AJ = l=>p*=pis the only member of the corresponding 
mixtures that hzis finite quadratic risk. For FIML, on the other hand, all the 
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corresponding mixtures have finite risk if T — n — m > 2. In either case, when 
we consider Nagar-type approximations to these moments we are in effect 
evaluating the risk functions with respect to finite sample approximations to 
the exact sampling distributions of the p+ estimator. 

We note the following well known results: 

plim T(p - p)(p - pY = (Q (8) M-^) = lim TVq = 0(1). (41) 

T-+00 

In other words, under our standard assumptions, Vq = 0(T“^). When p"^ 
is the SSLS estimator, it follows that: 

plim T(p+ - p)(p+ - pY = (S-^ ® Q')G{B'-^ ® Q) 

T-*oo 

= 0 ( 1 ) 

= limTV’^'^, say. (42) 

Consequently Maasoumi (1978) has shown that, if p"^ de- 

notes the SSLS estimator, 

plimT(p+-p)(p-p+)' = 0 (43) 

T-*oo 

and 

plim T(p"‘" - p)(p - pY = lim TV^ = ^(1)- (44) 

T->oo 

The asymptotic properties given in (43)-(44) hold for both the SSLS and 
FIML estimators and may also be deduced from a Rao-Blackwell lemma; 
see, for example, Hausman (1978). They do not hold for the asymptotically 
less efficient 2SLS or LIML reduced form estimators. 

In AJ, if we replace all terms with their 0(T”^) approximations and 
utilize the results in (41)-(44), we find 

a; = b+'Wbt/ {tr[W^(Vo - V+)] + b'+Wb+ } , (45) 

where is the approximate bias of p“*" obtained by retaining terms of 
Op(T“^/^) in the expansion of p+ — p. We note that, if 6+ = 0(T“^/^), 
AJ = 0(1) since Vb = (H 0 {Z'Z)~^) = 0{T~^) and 

v+ = {B~^ ® Q') S [5' (s"‘ ® Rj S’] S' {B'~^ ® Q) 

= 0{T-^), 

where R = {X*Z){Z^Z)~^{Z'X)y and under these conditions we have: 



( 46 ) 
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whenever Vq — is non-negative definite. This last condition is clearly sat- 
isfied for the full information estimators which permitted the simple formula 
in (45). 

The mixing parameter AJ has several desirable properties: 

(i) As the efficiency gain of the restricted estimator over the ULS de- 
creases, AJ — ► 1 and the corresponding mixed estimator (p*) moves closer to 
the simple ULS estimator. 

(ii) As the bias of the efficient estimator increases, AJ — ► 1 and p* — > 
p. This is evidently desirable since this bias would be large either due to 
structural misspecification or due to poor finite sample properties of the 
efficient estimator (even as judged by its approximate distribution), or both. 
On the other hand, p* — ► p+ as AJ — ♦ 0 which occurs as 6+ — > 0. 

(iii) The formula for AJ is seen to provide a mechanism for pooling of 
estimators (predictors) which accounts for the efficiency-bias trade-offs. 

(iv) Under correct specification p+ is a consistent estimator. Therefore, 
when the sample size is “sufficiently” large it is reasonable to expect 6^ to 
be close to zero. This will also pool the mixed estimator toward the asymp- 
totically desirable estimator (p"*"). This pattern of large sample behavior for 
6+ has been confirmed by numerous Monte Carlo studies — for example, see 
Maasoumi (1977), and Rhodes and Westbrook (1980). 

The formula given for AJ in (40) may of course be approximated at a 
higher level. It can be verified that the next possible degree of approximation 
will retain terms of 0(T“^). The resulting value for A will behave more like 
AJ while exhibiting only some of the properties enumerated for AJ. While 
these higher order expressions may be computed from Section 2 of this pa- 
per and the moments given by Sargan (1976a), improved approximation is 
by no means guaranteed by the additional terms. Some have argued that if 
0(T“^) terms are of significance then the sample size is too small to allow 
reliable inferences in re£isonably sized simultaneous systems. Nevertheless, 
there is a higher level of approximation for that results in an interesting 
variant of AJ. This is obtained from AJ by replacing with an 0{T~^) 
approximation of b'^ given in Section 2, and maintaining the 0(T“^) ap- 
proximations for variances and covariances. This approximation produces 
a mixing parameter. A*, which is 0(T“^), and therefore a mixed estimator 
(predicator) which is asymptotically equivalent to the asymptotically desir- 
able method beised on p"^. A further justification for this choice of A is that, 
if p+ is consistent, the odd order terms (Op(T’"^/^), Op(T“®/^) etc.) in the 
expansion of p+ — p have zero expectations under the normality assumption 
and may be dropped in obtaining an 0(T”^) approximation for 6+. 
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BOOTSTRAPPING AND FORECAST UNCERTAINTY: 

A MONTE CARLO ANALYSIS 

ABSTRACT 

For certain kinds of problems such as capacity planning, the estimate 
of forecEist uncertainty may be os important as the prediction itself. Ear- 
lier research (Veall, 1985) has applied Efron’s bootstrapping technique to a 
linear regression forec 2 ist of peak demand for Ontario Hydro. This paper 
presents a limited Monte Carlo analysis to assess the potential accuracy of 
bootstrapping for this example. 

1. INTRODUCTION 

Since the invention of the bootstrap by Efron (1979), there have been a 
number of applications of the technique to problems in applied econometrics 
(for example, Finke and Theil, 1984; Freedman and Peters, 1984c; Kora- 
jczyk, 1985; Vinod and Raj, 1984). One particular example of great interest 
is the work with energy forecasting of Freedman and Peters (1984a,b), which 
is closely related to the simulation methods of Fair (1979, 1980) for evalu- 
ating macroeconomic predictive accuracy. Standard methods of estimating 
forecast uncertainty are inadequate because they rely on normality assump- 
tions and/or on the assumption that values of the independent variables for 
the forecast period are known with certainty. Largely because of their non- 
parametric features, the bootstrap and other methods of computationally 
intensive statistics represent potential solutions to these difficulties. 

This paper will describe a simple Monte Carlo evaluation of the suc- 
cess of bootstrapping in assessing forecast uncertainty. Because of the large 
computer costs involved in using a computer-intensive technique to analyze 
another computer-intensive technique, the Monte Carlo experiment is neces- 
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sarily very limited. However, the case chosen is of special interest as it relates 
to the forecasting of peak electricity demand in Ontario, where the uncer- 
tainty estimate is critical with respect to capacity planning (Ellis, 1980). If 
demand is very uncertain, the risk associated with generation types which 
have high lead times, high fixed costs but low variable costs may be too great 
and capacities with lower lead times and fixed costs may be desirable even 
though these have higher running costs. If demand estimates have low un- 
certainty, all other things equal, capital-intensive, high-lead-time generating 
capacity will be more worthwhile. 

The example chosen is based on Veall (1985). To the author’s knowledge, 
the research here is the first sampling experiment on bootstrap estimation 
of the probability distribution of the forecast error. 

2. BOOTSTRAPPING FORECAST UNCERTAINTY: AN EXAMPLE 

Explanations of bootstrapping are given by Efron (1979) and Efron and 
Gong (1983), with some description of the forecasting problem given by 
Freedman and Peters (1984a,b). A detailed discussion of the example here 
is presented by Veall (1985) so that only the briefest possible explanation is 
provided in the following. 

The model here is very simple, so that attention can be focussed on 
the estimation of the forecast uncertainty rather than the point estimates. 
Estimated by Ordinary Least Squares (OLS), the result for the first equation 
is: 

\oz(PEAKt) = 0.8953 + 0.9482 \og{AMWt) + e* (1) 

with standard errors 0.1257 and 0.0141 respectively, an R* of .9960 and a 
Durbin- Watson statistic of 1.7709. For the second equation: 

log(AAfWt) = 1.2331 - 0.3887 log(Pt) + 0.4336 log(Yt) 

+ 0.0293 TIMEt + m (2) 

with standard errors 1.7291, 0.0495, 0.1083 and 0.0046 respectively, an 
of .9985 and a Durbin- Watson statistic of 1.7526. All data are for the period 
1963-1982 for the East System of Ontario Hydro (which comprises about 90 
percent of total provincial demand), PEAKt is peak demand in megawatts 
(mW), AMWt is average demand over the year, also in mW, Pt is the real 
average price of electricity, Yt is total Ontario real income and TIMEt is ^ 
linear time trend. The data are described by Veall (1985) and are available 
upon request. 

A number of diagnostic tests were performed on the model given by 
(l)-(2). While the details are given by Veall (1985), summarizing the main 
results, it is found at the 5 percent level: 
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(1) The null hypothesis of no serial correlation could not be rejected for 
either disturbance (Durbin- Watson test, Godfrey (1978) test). 

(2) The null hypothesis of homoscedasticity could not be rejected for 
either disturbance (White (1980) test, Breusch and Pagan (1979) 
test, Engle (1982) ARCH test). 

(3) The coefficients of additional variables added to (2), specifically 
weather variables and the log of the real price of natural gas, were 
not significantly different from zero. 

(4) The null hypothesis that e and r) were uncorrelated could not be 
rejected (Hausman (1978) test). 

(5) The null hypothesis of normality of e and 17 could each not be re- 
jected (Shapiro and Wilk (1965) test, Shapiro and Francia (1972) 
test, Kiefer and Salmon (1983) test, all as applied to the OLS resid- 
uals as suggested by White and MacDonald (1980)). 

While the power of all these tests in such a small sample is questionable, 
the fact that there is no evidence of heteroscedasticity, serial correlation 
and simultaneous equations bias is encouraging for the following bootstrap 
application. As the bootstrap is nonpar ametric, the normality tests are not 
as important. 

For forecEisting the peak demand for the year 1990, it is assumed that 
the real price of electricity will be at its 1982 level and real income will 
grow at 2 percent per year. To follow what is perhaps the most complex 
example given by Veall (1985), it is in addition assumed that there is a 
subjective uncertainty attached to each of the log real price and log real 
income forecasts corresponding to a random normal variable with mean 0 
and standard deviation 0.1. One of the justifications for the bootstrap is that 
as PEAKiqqo is a function of AMWiggo^ which is in turn a function of P1990 
and Yi 99 o, then even if e, 17, Y1990 and P1990 are all normal, PEAK\ggg will 
not be. This suggests that the nonparametric aspect of the bootstrap may 
be very important. 

The actual bootstrapping is performed as follows: 

(i) An ordinary regression foreceist for 1990, PEAKiggo, is calculated 
using the estimates from (1) and (2) and the point predictions of 
Pi99o> ^1990 and T/MP1990, and then taking the exponent. 

(ii) Artificial residual sets rj* and i* for 1963-1982 are created by draw- 
ing randomly and with replacement 20 times from 17 and i. This is 
repeated t = 1, . . . , P times. Random numbers throughout the pro- 
cedure are generated by routines due to Wickman and Hill (1982) 
and (for normal random numbers), Beasley and Springer (1977). 
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(iii) B artificial samples are created by replacing fjt with r\\ in (2) and 
taking the resulting log [AMWl) and using it along with ^ in (1). 

(iv) A forecast PEAK-^^^q is calculated by estimating the model (1*) 

and (2*) on each bootstrap sample i = and following the 

process described in (i) above, using the estimates from the bootstrap 
sample. 

(v) A simulated actual PEAK\qqq is calculated by first construct- 
ing log(fi99o) and log (17990) which are B independent artificially- 
generated AT (0,0.01) variables added to log(Pi99o) and log(Yi99o) re- 
spectively. Then log(AMW^299o) is calculated by putting log(Pi99o) 
and log(1799o) into (2) and replacing rj with 171990 > another resid- 
ual drawn randomly and with replacement from fj. Continuing, 
PEAKiQgQ is calculated by putting log(AAfWi99o) in (1) and re- 
placing i with ^1990 (another residual drawn randomly and with 
replacement from e) and taking the exponent. 

(vi) The simulated forecast error 5 FE* is calculated as 

PEAKiqqq — PEAK iggQ, 

(vii) The distribution of PEAKiggo, conditional on the model and current 
information is modeled as: 

P^Kiggo-hSFE\ ( 3 ) 

In words, forecasts from the bootstrap samples are compared to “simulated 
actuals” computed using random draws from the residuals e and 17 as well as 
assumed subjective probability distributions for P1990 and 1^1990. The differ- 
ences are estimated forecast errors and are used to estimate the uncertainty 
in the single regression forecast PEAK iggg. 

3 . MONTE CARLO ANALYSIS 

As mentioned, the Monte Carlo analysis of the bootstrap will be neces- 
sarily simple and limited, because of computational cost. The first step was 
to calculate a “true” distribution for 1990 peak demand, conditional on the 
model and current information. This is done by maintaining the assumption 
that log(Pi99o) and log(yi99o) are AT ( 0 , 0 . 01 ) and in addition assuming e and 
rj are also normal with standard deviations equal to .0206420 and .0142578 
respectively. These standard deviations are the standard errors of regres- 
sions ( 1 ) and ( 2 ). It can be seen that the 1990 peak will then be distributed 
lognormally with location and scale parameters that are easy to calculate. 
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The next step is to determine how well the bootstrap estimate of the 
distribution could estimate the true distribution just calculated. To do this, 
200 Monte Carlo data sets are calculated for 1963-1982 using the estimates 
and the variables Pt, Yt and TIMEt (with no randomness) from (1) and 
( 2 ) with it and fjt replaced with random normal disturbances with standard 
deviations equal again to the standard errors of their respective regressions. 
Each Monte Carlo data set is then used as a bzisis for a bootstrap procedure 
and the resulting set of bootstrap estimates of the probability distribution 
is compared to the “true” distribution above. A variety of methods are 
attempted, such as using 100, 500 or 1000 bootstraps, changing the number 
of Monte Carlo data sets from 200 to 500 and employing Efron’s smoothed 
bootstrap (1982, p. 30) with various smoothing constants. The results are 
presented in Table 1. The analysis will be largely from the perspective of the 
upper tail, as presumably this is most important to the capacity planner. 

With 100 bootstraps, the average bootstrap estimate of the .90, .95 and 
.99 probability points is quite accurate, in all cases within about .5 percent of 
the truth. The variation across Monte Carlo runs can be analyzed somewhat 
crudely by looking at the standard deviations of these bootstrap estimates: 
these are typically of moderate size, about 600-800 mW or about 3 percent 
of the true value. For this example the lower tail does not fit as well but 
both the median of the bootstrap and the bootstrap forecast (the average of 
the bootstrap forecsists) are quite accurate estimates of the actual median 
22096 (which is also the actual forecast with the original data). Finally, the 
bootstrap estimates of the bias (equal to the forecast from each Monte Carlo 
1963-1982 data set subtracted from the average of the bootstrap forecasts) 
are all tiny. 

Increasing the number of bootstraps to 500 or 1000 doesn’t change these 
results very much, improving the fit in the lower tail and, surprisingly for 
1000, worsening the fit slightly in the upper tail. Returning to 100 bootstraps 
but increasing the number of Monte Carlo runs to 500 has no important 
effect. 

In different kinds of bootstrap problems than the one studied here, Efron 
(1982, p. 32) reported some success with smoothing, which is accomplished 
by adding independent draws from pseudo-random variates N{0yk^a^) and 
N{0,k^ag) to the bootstrap residuals 17 J and i\ respectively, where k is an 
arbitrary smoothing constant. These disturbances are then scaled back to 
their estimated variances by dividing each result by (1 + A:^)^/^, and the 
rest of the bootstrap proceeds as before. Compared to the unsmoothed case 
with 100 bootstraps, a smoothing constant A: of .1 helps very slightly in the 
tails but there seems to be little gain from further smoothing. As A; — > 00 , 
the smoothed bootstrap becomes the so-called parametric bootstrap which 
replaces the residual distribution entirely with a parametric (in this case 
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normal) distribution. Again somewhat surprisingly the Monte Carlo results 
indicate that there is little gain from this procedure for this example. 

To isolate the residual/bootstrap part of the exercise, the calculations 
in Table 2 are based on another case (Veall, 1985) with the zussumption 
that 1990 price and income are known and hence nonrandom. Again for 
this example the bootstrap is quite accurate in the upper tail, although 
perhaps not as accurate as before. The bootstrap continues to overestimate 
the dispersion in the lower tail and the pattern is not changed much by 
smoothing. 

Finally, Table 3 is based on the assumption that the peak distribution is 
uniform, so that the nonpar ametric nature of the bootstrap may be explored. 
Maintaining the assumption of nonrandom 1990 price and income, the actual 
1990 peak probability distribution can be calculated by exponentiating the 
appropriate convolution of the normal and uniform. The table shows that 
the upper tail is again fit fairly well by the bootstrap. Smoothing again is 
not very helpful but is not especially deleterious, despite the fact that it is 
employing a normal kernel and the actual distribution is nonnormal. 

4. CONCLUSIONS 

For this series of Monte Carlo experiments on this example, the accuracy 
of the bootstrap in estimating the upper tail of the probability distribution 
of future peak demand is remarkable. This is particularly so given that most 
of the evidence is b£used on methods which employ 100 bootstrap samples. 
While the upper tail is most important for the capacity planning example at 
hand, the lower tail is not fit as well, suggesting caution in the application 
of these findings to other situations. In this example there is no evidence 
that increasing the number of bootstrap samples or using the smoothed or 
parametric bootstrap will offer much improvement. 

ACKNOWLEDGMENTS 

Thanks are due to D. Fretz, G. Green and W. Cheng for research assis- 
tance and to B. Efron, R. Ellis, I. McLeod, R. Tibshirani, F. Trimnell and 
A. Ullah for valuable advice. The assistance of the Centre for the Study 
of International Economic Relations at The University of Western Ontario, 
Ontario Hydro and the Social Science and Humanities Research Council of 
Canada is acknowledged. 




Table 2. Monte Carlo Results in Bootstrap Estimate of Distribution of 1990 Peak Demand 

(Nonrandom 1990 Price and Income) 



BOOTSTRAPPING AND FORECAST UNCERTAINTY 381 



(D 



a 

cd 

03 S 

-g CQ 

O 

CQ W 



q; 



05 
I CD 



05 tH 
I CD 



CD tH 
I CD 



LO 

I CD 



b- ^ 

O 05 

o 



lO 




00 




CO 




00 




05 


05 


(M 


05 


o 


00 


05 


lO 


i-H 


00 




CD 


T— 1 


CD 


(M 


CD 






(N 




CS 


^ 




q 





CO ^ ^ 






*s 


tH 


00 


H 


o 






CD 


»o 


CO o 


ID 


tH 


CO 

C<l 


5 lo 
cs 




t> 

co 


tH 

tH 


tH ^ 
CO » 


00 

CO 

cs 


LO 

00 


tH 


ID 




Tt< 




cs 






tH 


CD 




LO 


00 


00 rH 


rH 




o 


«M rH 




CO 




(M ID 




tH 


CO 


CO 00 




CO 


tH 


CO tH 


CO 


tH 















(M 




CD 


T}< S 




rH 


^ ^ 


uo 


CO 




o 


O 05 




05 


tH 


rH CO 


LO 




00 


O 00 




O 


CO 


O (N 


rH 


LO 


CSI 


CO t- 




CO 




00 tH 


CO 


tH 












<M 








CO 




05 




O ^ 


CO 




CD 


o 






CO 




CD 


00 




LO CD 




CD 


rH 


to 00 


CD 


rH 


(M 


cq 1— 




(N 


tH 


(M CD 


cs 


tH 














<M 


'' — ^ 


CD 


CD ^ 




tH 




rH / — N 


LO 




05 


O 




00 


CD 


05 


rH 


00 


o 


O CO 




o 


05 


05 lO 


rH 


tH 




t- 






CD 


rH CD 


(M 


CD 




CS ^ 








cq 






tH 


CO /— S 




lO 




rH 


05 




CO 


t>- to 




CO 




CO 05 


lO 


rH 


t— 


CS 




lO 


00 


CO 


lO 


CD 




rH tH 




rH 


Sy 


rH CD 


rH 


CD 




CS ' 








<M — 


(M 




00 


CD /—s 




05 




lO ^ 


05 




o 


tH O 




rH 




rH 00 


LO 


rH 




05 CO 




o 


05 


05 CO 


o 


CD 


rH 


O tH 




rH 


CD 


O CD 


rH 


CD 


(M 








' — ’ 


(M 






b- 


tH /-H 




lO 


^ ^ 




LO 




tH 


CD CO 




05 


CD 


rH O 


CO 


O 


cs 


CD CO 




CD 


05 


CD 


tH 


tH 




o t> 




o 


CD 


O CD 


o 


CD 


(M 


CS 








' — 






Tf 


00 




CO 




CD 


CO 




CD 


CO 




CD 


00 


CD 05 


o 


LO 


00 


O tH 




O 


CO 


05 tH 


rH 


(M 


o 


O 00 




O 




05 CD 


o 


tH 


(M 





pH 


cs 




rH '■ — ^ 






d 


<D 


(d 












o 


13 


M 

443 












’•+3 


d 


CQ 












d 


« 


443 












:S 


4.3 


O 














CO 


o 














W 


CQ 












W3 


Oh 




o 




to 


o 




5 


d 


03 


tH 




q 


q 








rP 












o 


443 

CO 

443 

o 

o 


443 

o 

o 

d 


li 




II 


II 




CQ 


M 

in 













Standard errors are in parentheses. All results based on 200 Monte Carlo replications and 100 bootstraps. 




Table 3. Monte Carlo Results on Bootstrap Estimate of Distribution of 1990 Peak Demand 
(Nonrandom 1990 Price and Income; e Uniform) 



382 



MICHAEL R. VEALL 



^ CD 

CO 

^ “ 2 

o CQ 

C§ W 

cO ^ g 

Sn 

4) -JJ 

> O t 





o 




CD 




fH 






y. S 


tH 


t> 




o 






CO 


CM 


CO 


CO 


CD 


lO 


h— 


rH 


00 


CM 


00 


CO 


CO 


CO 


00 


CO 


00 


CO 


00 


CO 


00 






^ 


CM 




CM 


'■ ^ 


CM 




rH 

o 

CO 

(M 


rH 

00 

<M 

CO 


(814) 


23306 


rH 

CD 


23429 


(796) 


23422 


(797) 


o 


t- 




rH 




CD 




00 


^ 




rH 


t- 


CD 


CO 




CO 


LO 


LO 


00 


o 


05 


O 


lO 


rH 


00 


rH 


CD 




CO 




CO 


t- 


CO 


t- 


CO 






CM 




CM 


' ^ 


CM 


' — '' 


CM 


' — ^ 


CO 






CD 




lO 




rH 




rH 


lO 


CD 


00 


00 


lO 


LO 


tr 


O 


lO 


ID 


CD 


lO 


CM 


CD 




CD 


CO 




CM 


b- 


CM 


t- 


CM 




CM 


tr 


(M 


CM 




CM 


' — ' 


CM 


' ^ 


CM 


^ 


CD 


rH 




rH 




CD 




O 




05 


o 


CM 


CO 


CM 


05 


00 


rH 


rH 


O 


o 




O 


05 


O 


o 


rH 


o 


(M 


CM 




CM 


CD 


CM 


t- 


CM 


b- 


C<l 


CM 




CM 




CM 


'■ — ^ 


CM 


^ 


t- 


rH 




rH 




00 




LO 




00 


CD 


CD 


CD 




CO 


CD 


LO 




CD 




CM 




00 


LO 


b- 


LO 


00 




tH 




rH 


S' 


rH 


s 


rH 


CD 


CM 


CM 




CM 




CM 




CM 




t- 


l> 


^ 


CD 




O 




CM 




b- 


CD 


00 




t- 


rH 




LO 


05 


CO 


05 


CM 


05 


00 


O 


b- 


o 


LO 


rH 


O 




o 


s 


rH 


CD 


rH 


CD 




CM 




CM 




CM 


' — ^ 


CM 


^ 


lO 


CD 


^ 


CD 




tr 




LO 




tH 


CD 


rH 


CO 


ID 


o 


CO 


CM 


CD 




CD 


CM 


CD 


05 


t- 


b- 






T — 1 


O 




O 


CD 


o 


CD 


o 


CD 


(M 


CM 




CM 




CM 


'■ — ^ 


CM 


^ ^ 


(M 


00 


^ ^ 


lO 




o 




00 








CM 


o 


O 




LO 


00 


LO 


05 


o 


O 


o 


t- 


o 


rH 


o 


CO 


O 


o 


00 


o 




o 


b- 


o 


b- 




CM 




CM 




CM 




CM 





Standard errors are in parentheses. All results based on 200 Monte Carlo replications and 100 bootstraps. 
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Hiroki Tsurumi ^ 



USE OF THE MEAN SQUARED ERRORS OF FORECASTS 
IN TESTING FOR STRUCTURAL SHIFT: A COMPARISON 
WITH THE CHOW TEST FOR AN UNDERSIZED CASE 

1. INTRODUCTION 

The mean squared-errors of forecast (MSEF) is a statistic used to eval- 
uate post-sample prediction performance. The MSEF has been used as a 
descriptive measure, but its exact distribution can be derived either from a 
sample theoretical or from a Bayesian perspective if the MSEF is computed 
from a linear regression model. In this paper, sampling and Bayesian distri- 
butions of the MSEF are derived, and it is suggested that the MSEF may 
be used as a statistic to test for structural shift. The powers of the MSEF 
are compared with those of the Chow test for a case where the sample size 
of the second regime is less than the number of the regression coefficients 
(i.e., the undersized case). 



2. SAMPLING AND BAYESIAN DISTRIBUTIONS OF THE MSEF 

Let the linear model be given by 

y = Xp + u, (1) 

where y is an (n x 1) vector of observations on the dependent variable, X is 
an (n X k) matrix of observations on the explanatory variables with rank k, 
u is an (n x 1) vector of error terms, and ^ is a (A: x 1) vector of unknown 
regression coefficients. Assume that u ^ N{0,a^I„) and that P is estimated 
hy P = (A'X)-iA'y. 

The mean-squared-error for the post-sample period, nH-l,...,n + mis 
computed using the post-sample actual observations on y and X. Let y* and 
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X* be, respectively, an (m x 1) vector and an (m x k) matrix of post-sample 
observations and assume that the rank of X* is min(m, k). Then the MSEF 

1 

MSEF = -(y* - y*)'(y* - y*)» (2) 

where y* = Given equation (1) and ^ — p -{■ equation 

(2) can be written as 



1 1 ^ 

MSEF = — e'.B'Be. = — Y" (3) 

m m 

t=i 

where: e* = (u',ti')', B = {A,—Im), A = X^{X'X)'~^X' , and the //»’s are 
the nonzero characteristic roots of B'B, The are elements of € = c'e*, 
where c is the matrix of characteristic vectors of B'B. In passing, let us 
note that the /i»’s are given by = 1 + A*, i = 1, . . . , m for m < A;, and 
= X + At, 1 = 1, . . . , A;; fii = 1, i = A; + 1, . . . , m, for m > A:, where A» is 
the ith nonzero characteristic root of AA'. 

Since e* ^ iV’/D(0, m • MSEF is a quadratic form in normal vari- 
ables. The distribution of quadratic forms or ratios of quadratic forms has 
been investigated by many; some of the earlier works are by McCarthy 
(1939), von Neumann (1941), and Bhattacharyya (1943). Bhattacharyya 
(1945) and Hotelling (1948) employed Laguerre expansions, and Gurland 
(1953) and Johnson and Kotz (1970) refined further the convergent Laguerre 
expansions. Tsurumi (1985) used the degenerate hyperbolic function, which 
is convenient for computational purposes. Since we will use the distribution 
of a quadratic form in normal variables, let us give Tsurumi ’s result as a 
lemma. 



Lemma 1. Let x = m • MSEF/a^. Then the distribution of x is given by 






( 4 ) 



where c(m,p) is the recursive coefficient given by 



c(m,p) 



r(p + c(m - l,j)a^ ^ 

r(P+f) ^0 (P-J)’ ’ 



for m > 2, 



( 5 ) 



and c(l,0) = 1, c(l,i) = 0 for i > 1; o„ = | Pi > M 2 > 

* * • > for m > 2, ai = (1/2^ J, and aj = 1 for all t = 1, . . . , m. 
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Proof. See Tsurumi (1985). Using this lemma, we will establish the sam- 
pling distribution of the MSEF. 



Theorem 1. Let z be the MSEF. Then u = has the probability 

density function given by 



where 



and 



/(« I A*!, • • • , Mm, v) = const • 



^m/2 — 1 



(1 ^ 23i^)(m+i/)/2 



const = 



-■'■nL.,-;'’!'© 



1 / 8 * = y'[I - X{,X'X)-^X']y, v = n-k. 



( 6 ) 



Proof. Equation (4) in Lemma 1 is the probability density function of 
x= m- MSEF/ff^. The probability density function (pdf) of z = MSEF 
may be obtained by transforming z = (<r*/m) x, and this becomes 









,P,P,-2P 



where 



( 7 ) 



Cl = 



m 



mf2 



On the other hand the pdf of is given by 

/(s' I = 2 i/ ? r ~ (f) ' (^) exp{-i/8*/(2a*)} 

and it is ea,sy to show that z and are independent. Hence, the joint pdf 
of z and is 

f{z,s^ I m,a^) = 

• exp I -^[1 + m«/(i//i.nS*)]| ^c(m,p)m'’«P<r“*P, 

^ ^ p=0 
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where C3 = ci[2‘'/^r(i//2)] Changing the variables z and to u = 
and y = 8^ ^ and integrating out y, we obtain the desired result. 

The Bayesian predictive density of the MSEF may be derived in two 
ways. We may first reduce the MSEF as a quadratic form in normal vari- 
ables by subtracting out P and obtain equation (3). Realizing that this 
quantity contains a nuisance parameter, <7^, we integrate it out by utilizing 
the posterior pdf for <r^ , This approach was used by Tsurumi (1985). Alter- 
natively, we may start with the joint density for y*, and <r^, and integrate 
P first, and transform y* into the MSEF given <7^. Finally we integrate <x^ 
out. This second approach is used in the theorem below. 



Theorem 2. The Bayesian predictive density of the MSEF is given by 



p{z I s^,v,m) a 



2Pc[m,p) ( 



mz 



p=0 



i/s^ H- mzlii, 






( 8 ) 



Proof. The predictive pdf of y* is given by 

p{y* I y,X,X,) = J j p{y* I \ y,X)dadp, (9) 



where 



p{y,\ fi,a^,Xt) oc. a "*exp 






and we shall use the posterior pdf of P and a that is given by 

p{p, (T\y,X)cK exp I - ^ [i/s* + P)'X'X{P - /3)] | . 



Thus p(y* I P,a^jX^)p{Py(T | y,X) becomes 

p(y. 1 I y,-^) « 

• exp [yiil - X,X+)y, + {^ - ^)'X',X,{p - y§.)] | 

■ h' + “ h'X'X{^ - 4)] } , 



( 10 ) 
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where is the Moore-Penrose generalized inverse of X* , and = X'jl' y* + 
(/ — )r, where r is an arbitrary m x 1 vector. Arranging the sum of 

two quadratic forms in P into a quadratic form in /?, and integrating P out, 
we derive from equation (9) 

p(y. I y>x,x,) <x j exp 

where H = [I + X^{X*X)^^Xl]~^. The right-hand side of equation (11) 
shows that given a, y* — is distributed 2 is AT(0, Let w = 

y^-X^P and H = R*Ry where Risa, nonsingular matrix. Then m-z = w'w— 
r)*{RR')~^T)y with rj = Rw ^ AT (0,(7^/,^). Since the nonzero characteristic 
roots of = {R*R)'~^ are the same as those of we see that 

rri' z — J2^i with ^ Ar(0,cr^), and m is the ith characteristic root 
of . Hence, given a, m • z has the distribution of a quadratic form in 
normal variables that is given in equation (4). Thus 

p{z 1 s*,m) oc j 1 <r^m)exp (12) 

and integrating a out, we obtain the desired result. 



3. USE OF THE MSEF FOR TESTING FOR A PARAMETER SHIFT 



The MSEF may be used for testing for a structural shift or for compar- 
ing non-nested linear models. Tsurumi (1985) presented numerical examples 
of the use of the MSEF for testing for a structural shift and for comparing 
non-nested linear models using actual data. In this section, let us focus our 
attention on the use of the MSEF for testing for a structural shift when 
the number of post-sample observations, m, is smaller than the number of 
regression coefficients, k. We assume that the join point is known and the 
switch from the first to second regime is abrupt. Using sampling experi- 
ments, we compare the performances of the MSEF with those of the Chow 
test. The performances are judged by empirical powers. 

The hypotheses of a parameter shift may be given as follows: 

Hq:P = P^ versus Hi : P ^ P^, (13) 



where /?* is the vector of regression coefficients in regime 2. The Chow test 
for an undersized C2ise (m < A:) is given by the following F criterion: 
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where e'e = {y - X^)'(y - #), y = {y'^y^Y, X = and ^ = 

[X‘X)~^X‘y. Thus, e'e is the sum of squared residuals from the pooled 
sample of size n + m. For m = 1, the sampling MSEF test criterion, u — 
is identical to the Chow test, and this is given in the following 

lemma. 

Lemma 2. For m = 1, the sampling MSEF test criterion, u — is 

identical to the Chow test. 

Proof. For any m > 1, the sum of squared residuals from the pooled sample 
becomes 



e'e = y'[I - X[X'X)-^X'\y 

= y'y - {y'XB-^X'y + 2y',X,B~^X'y + y',X,B~^X',y,), 
where B~^ — {X'X + Using the identity 

{x'x + x'.x,)-^ = {X'X)-^ - {X'X)-^X,HX',{X'X)-\ 

where H = [J + we see that 

y'XB-^X'y = p'X'Xp - p'X.HX'J, 
y',X,B-^X'y = y',HXj, 

and 

=y:(/-£f)y.. 

Hence, e'e becomes 

e'e = i/s* + (y. - Xj)'H{y, - Xj), 



the Chow test statistic becomes 

F{m,u) = (y* - XjYH{y. - Xj)/{ms^) 

and for m = 1, = Mi = , we have F(l, u) = z/(/ii«^). 

The MSEF test criterion, u = z/{fimS^)y can be interpreted either from 
the sampling or Bayesian perspective. In the sampling theory view, both 
z and are regarded as random variables, whereas in the Bayesian view 
z is the random variable and is a given realized random variable. The 
predictive density of z is obtained given and inference about z is made 
conditionally on s^. The predictive density of Zy then, will vary if varies. 
If we wish to use the Bayesian MSEF criterion to test for a parameter shift 
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such 2 LS is given in Hq versus Hi in (13), the power of the test will have 
to be obtained given a particular value of s^. If the realized value of 
is much smaller than the value of under which the data, y and y* are 
generated, the power associated with this small may tend to be higher 
than the powers of the sampling MSEF criterion or of the Chow test. In 
sampling experiments, we shall derive the power of tests for various values 
of and identify that value of which yields roughly the same power as 
the sampling MSEF. If this value of comes from, say, the 70th percentile 
point of the sampling distribution of then one may argue that at least 
70% of the time, the Bayesian MSEF criterion produces a better power than 
the sampling MSEF criterion. 

Sampling experiments are made under the following model and design: 

+ ^ 2-^*2 + PsXis -|- ^4^i4 + ^5^»5 + t = 1 , . . . , U. (15) 

Xi2 and Xiz are drawn from uniform distributions, U{0, 3), and are indepen- 
dent of each other. Since economic data often suffer from multicollinearity, 
we design Xi 4 and X »5 as follows: 



Xi4 = Xi2 + Vii, Vii ^ U{0y .5), 

Xis = Xi 2 -h v »2 ^ 1 ), 

and Vii and v *2 are independent of each other. The values of {i = 1, . . . , 5) 
under the null hypothesis, Hq^ are 

Ho : {Pi = 10, p2 = .4, Pz = .6, P4 = 1, Pz = 2) 

and under the alternative hypothesis, JEfi, P 2 and Pz are scalar multiples of 
P 2 = .4 and Pz = .6, while Pi^ P 4 , and Ps stay the same as in Hq. The error 
term of the regression equation, €*, is drawn from ^ NID{0^ 1), and we 
set the sample size, n, to be 20. The values of are drawn once and for 
all, and 500 replications are made for each set of [Piy . . . , ^ 5 ). The determi- 
nant of the matrix of simple correlation coefficients among Xi 2 , . . . , Xiz is 
.0006195, showing a high degree of multicollinearity. The average coefficient 
of determination, for each set of 500 replications is .72 . 

As discussed earlier, the Bayesian interpretation of the MSEF is that 
the predictive density of z is derived given a realized sample of size y = 
(Yi, . . .yYnY- To derive empirical powers of the Bayesian MSEF criterion, 
^/(A*m«^)j we generated y* for the second regime in the following way. Since 
the predictive density of y* given X and X* is the multivariate Student t. 



p(y, I y,X,X,) oc + (y. - Xj)'H{y, - 




392 



HIROKI TSURUMI 



We generated the values of y* from the multivariate Student t pdf by using 
the estimate ^ that is obtained by ^ = /?+ , where €y are those 

values of the error terms that give rise to a particular value of , through 
y == (Yi, . . .,Yn)' : y = €y. For the Chow test and sampling MSEF 

criterion, on the other hand, y* is drawn from y* = + €*, e* ^ -^(0, 1). 

The matrix X* was generated in the same way as the matrix X. 

Empirical powers are presented in Table 1 for m= 1, . . . , 5. The MSEF 
statistics are given by u = both for the sampling and Bayesian 

interpretations. When m = 1, the Chow and sampling MSEF criteria are 
identical, and the Bayesian MSEF that is evaluated at the 75th percentile 
value of produces power that is comparable to the Chow and sampling 
MSEF statistics. Any value of that is less than the 75th percentile value 
produces higher powers than the Chow test does. For m = 2, the Bayesian 
MSEF produces equal or better power than that of the Chow or the sampling 
MSEF criterian for about 65% of the time. As m increases to 3, 4, and 5, 
the percentage of the times in which the Bayesian MSEF does equally well 
or better than the Chow or the sampling MSEF statistics declines slowly 
to the 55 percentile for m = 5. Comparing the Chow test statistic and the 
sampling MSEF criterion, we notice that for m = 2, the latter has a slightly 
better power than the former, but for m = 3,4,5, the Chow test performs 
better than the sampling MSEF. 

4. CONCLUSIONS 

In this paper, sampling and Bayesian distributions of the MSEF were 
derived, and using sampling experiments it was demonstrated that the MSEF 
criterion may be used as a statistic to test for a parameter shift. When the 
number of post-sample observations is less than the number of regression 
coefficients (m < A;), the MSEF criterion, especially the Bayesian MSEF, 
tends to perform better than or as well as the Chow test. For the case of 
m > fc, there is a Chow test statistic that is more powerful than the one 
given in equation (14) and thus the Chow test outperforms the sampling as 
well as the Bayesian MSEF criteria. 




TESTING FOR STRUCTURAL SHIFT 



393 



Table 1. Empirical Powers of the Chow, Sampling MSEF 
and Bayesian MSEF, Under Hq : c = 1, ot — .05. 



Values of c for cp 2 > under Hi 





1.5 


2.0 


2.5 


3.0 


3.5 


4.0 


m = 1 


Chow=Sampling MSEF 


.13 


.31 


.65 


.86 


.97 


1.00 


Bayesian MSEF(75%)* 


.21 


.35 


.66 


.88 


.98 


1.00 


m = 2 


Chow 


.12 


.26 


.51 


.70 


.86 


.95 


Sampling MSEF 


.17 


.33 


.58 


.77 


.90 


.98 


Bayesian MSEF (65%) 


.12 


.27 


.48 


.71 


.91 


.96 


m = 3 


Chow 


.06 


.14 


.27 


.52 


.78 


.89 


Sampling MSEF 


.07 


.10 


.22 


.48 


.61 


.82 


Bayesian MSEF (58%) 


.05 


.17 


.34 


.53 


.77 


.90 


m — 4t 


Chow 


.09 


.18 


.40 


.65 


.87 


.97 


Sampling MSEF 


.07 


.09 


.17 


.32 


.60 


.78 


Bayesian MSEF (55%) 


.10 


.23 


.38 


.71 


.91 


.99 


m == 5 


Chow 


.12 


.23 


.48 


.80 


.93 


.99 


Sampling MSEF 


.09 


.18 


.41 


.75 


.90 


.98 


Bayesian MSEF (55%) 


.08 


.28 


.60 


.86 


.97 


.99 



(1) Sample Size, n = 20, 500 replications. 

(2) Sampling and Bayesian MSEF is u = z/(UfnS^). 

(3) *Bayesian MSEF (75%) means that those values of 

= (gj, . . . j e^)' that gave rise to at the 75th 
percentile of the distribution of s* are used to generate y* . 
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